Structural Risk Minimization for C^{1,1}(\R^{d}) Regression

Structural Risk Minimization for C1,1(\Rd) Regression

Adam Gustafson adam.marc.gustafson@gmail.com Matthew Hirn mhirn@msu.edu Department of Mathematics and Department of Computational Mathematics, Science and Engineering, Michigan State University Kitty Mohammed kittymohammed1985@gmail.com Hariharan Narayanan hariharan.narayanan@tifr.res.in School of Technology and Computer Science, Tata Institute of Fundamental Research Jason Xu jqxu@ucla.edu Department of Biomathematics, University of California Los Angeles
Abstract

One means of fitting functions to high-dimensional data is by providing smoothness constraints. Recently, the following smooth function approximation problem was proposed by Herbert-Voss, Hirn, and McCollum (2017): given a finite set and a function , interpolate the given information with a function (the class of first-order differentiable functions with Lipschitz gradients) such that for all , and the value of is minimal. An algorithm is provided that constructs such an approximating function and estimates the optimal Lipschitz constant in the noiseless setting.

We address statistical aspects of reconstructing the approximating function from a closely-related class given samples from noisy data. We observe independent and identically distributed samples for , where is a noise term and the set is fixed and known. We obtain uniform bounds relating the empirical risk and true risk over the class , where the quantity grows with the number of samples at a rate governed by the metric entropy of the class . Finally, we provide an implementation using Vaidya’s algorithm, supporting our results via numerical experiments on simulated data.

1 Introduction

Regression tasks are prevalent throughout statistical learning theory and machine learning. Given samples in and corresponding values , a regression function learns a model for the data that best generalizes to new points . Absent any prior information on , the best regression function , as measured by the squared loss, is obtained by minimizing the empirical risk over a specified function class ,

 ˆf=arginff∈F1n∑a∈E|f(a)−y(a)|2,

subject to a regularization penalty. If is equipped with a norm or semi-norm , then the regularized risk can take either the form

 ˆf=arginff∈F1n∑a∈E|f(a)−y(a)|2+λ⋅Ω(∥f∥F), (1)

or

 ˆf=arginff∈F1n∑a∈E|f(a)−y(a)|2subject to∥f∥F≤M, (2)

where and are hyper-parameters, and is a monotonically increasing function.

In either case, the quality of is primarily determined by the functional class . Recently, numerous state of the art empirical results have been obtained by using neural networks, which generate a functional class through the architecture of the network. The class is also often taken as the span of a suitably defined dictionary of functions, or a reproducing kernel Hilbert space (RKHS) with an appropriate kernel. For example, when and is an RKHS, equation (1) leads to the popular kernel ridge regression scheme, which has a closed form solution that is simple to compute.

When , the smoothness of is determined by the dictionary , or if is a reproducing kernel Hilbert space, the regularity of is determined by the kernel. An alternate approach that does not require choice of dictionary or kernel is to specify the smoothness of directly, by taking or . (The former class contains the functions continuously differentiable up to order . The class of functions is similar, but it consists of the functions that are differentiable up to order , with the highest-order derivatives having a finite Lipschitz constant.) However, the computational complexity of minimizing the regularized risk over these spaces is generally prohibitive. An exception is the space , which consists of functions with finite Lipschitz constant, and for which several regression algorithms exist (von Luxburg and Bousquet, 2004; Beliakov, 2006; Gottlieb et al., 2013; Kyng et al., 2015).

In recent work, Herbert-Voss, Hirn, and McCollum (2017) provide an efficient algorithm for computing the interpolant that, given noiseless data , minimizes the Lipschitz constant of the gradient. In this paper we extend the methods of Herbert-Voss et al. (2017) to regularized risk optimizations of the form (2). In particular, we consider the noisy scenario in which the function to be reconstructed is not measured precisely on a finite subset, but instead is measured with some uncertainty.

An outline of this paper is as follows. In Section 1.1, we introduce the function interpolation problem considered by Herbert-Voss et al. (2017), and summarize the solution in the noiseless case in Section 1.2. Next, we consider the setting where the function is measured under uncertainty, and derive uniform sample complexity bounds on our estimator in Section 2.1. The resulting optimization problem can be solved using an algorithm due to Vaidya (1996); we provide details on computing the solution to the regularized risk in Sections 2.2 and 2.4. We implement the estimator and present reconstruction results on simulated data examples in Section 3, supporting our theoretical contributions, and close with a discussion.

1.1 Noiseless Function Interpolation Problem

Here we summarize the function approximation problem considered by Herbert-Voss et al. (2017). First, recall that the Lipschitz constant of an arbitrary function is defined as

 Lip(g):=supx,y∈\Rd,x≠y|g(x)−g(y)||x−y|, (3)

where denotes the standard Euclidean norm, and we note that in the sequel we use this definition on domains other than . Second, denote the gradient of such an arbitrary as . Finally, let be the class of -times continuously differentiable functions whose derivatives have a finite Lipschitz constant. In the function approximation problem, we are given a finite set of points such that , and a function specified on . The function approximation problem as stated by Herbert-Voss et al. (2017) is to compute an interpolating function that minimizes

 ∥f∥˙C1,1(\Rd):=inf{Lip(∇˜f)∣˜f(a)=f(a) for all a∈E}. (4)

The question of whether one can even reconstruct such an interpolating function was answered by Whitney (1934). Presume that we also have access to the gradients of on , and denote them . In the case of , the polynomials are defined by the specified function and gradient information:

 Pa(x)=f(a)+Daf⋅(x−a),a∈E,x∈\Rd.

Letting denote the space of first-order polynomials (i.e., affine functions), the map is known as a 1-field. For any , the first order Taylor expansions of are elements of , and are known as jets (Fefferman and Klartag, 2009), defined as:

 Jaf(x):=f(a)+∇f(a)⋅(x−a),a,x∈\Rd.

Whitney’s Extension Theorem for may be stated as follows:

Theorem 1 (Whitney’s Extension Theorem for ˙C1,1(\Rd)).

Let be closed and let be a -field with domain . If there exists a constant such that

1. for all , and

2. for all , and ,

then there exists an extension such that for all .

Given a finite set as in the function interpolation problem, these conditions are automatically satisfied. However, this theorem does not provide a solution for the minimal Lipschitz constant of . Le Gruyer (2009) provides a solution to both problems, which we discuss next.

1.2 Minimal Value of Lip(∇ˆf)

Herbert-Voss et al. (2017) define the following norm for when the first-order polynomials are known

 ∥P∥˙C1,1(E):=inf{Lip(∇˜f)∣Ja˜f=Pa for all a∈E}, (5)

and similarly define

 ∥f∥˙C1,1(E):=inf{Lip(∇˜f)∣˜f(a)=f(a) for all a∈E}, (6)

when the gradients are unknown, where in both cases the infimum is taken over functions .

Presuming we are given the -field , Le Gruyer (2009) defines the functional as:

 Γ1(P;E)=2supx∈\Rd(maxa,b∈E,a≠bPa(x)−Pb(x)|a−x|2+|b−x|2). (7)

Given only functions , Le Gruyer (2009) also defines the functional in terms of as

 Γ1(f;E)=inf{Γ1(P;E)∣Pa(a)=f(a) for all a∈E}, (8)

The following theorem is proven by Le Gruyer (2009), which shows that (7) and its equivalent formulation in (9) provides a solution for (5):

Theorem 2 (Le Gruyer).

Given a set and a -field ,

 Γ1(P;E)=∥P∥˙C1,1(E).

An equivalent formulation of (7) which is amenable to implementation is as follows. Consider the following functionals mapping :

 A(P;a,b) =(Pa(a)−Pb(a))+(Pa(b)−Pb(b))|a−b|2 =2(f(a)−f(b))+(Daf−Dbf)⋅(b−a)|a−b|2, B(P;a,b) =|∇Pa(a)−∇Pb(a)||a−b| =|Daf−Dbf||a−b|.

Proposition 2.2 of Le Gruyer (2009) states that

 Γ1(P;E)=maxa≠b∈E√A(P;a,b)2+B(P;a,b)2+|A(P;a,b)|, (9)

whence a naive implementation allows to be found in computations. Inspired by Fefferman and Klartag (2009), Herbert-Voss et al. (2017) also construct algorithms which will solve for the order of magnitude of in time, but we omit the details here. Additionally, as a consequence of the proof of Proposition 2.2 of Le Gruyer (2009), equation (7) may alternatively be written as

 Γ1(P;E)=2maxa,b∈E:a≠bsupx∈¯Bd(a+b2,|a−b|2)Pa(x)−Pb(x)|a−x|2+|b−x|2, (10)

where denotes the closed -dimensional Euclidean ball centered at with radius .

Recall that the gradients are typically not known in applications. As a corollary, we have the following convex optimization problem for finding (6), and the minimizing -field provides the gradients .

Corollary 3.

Given a set and a function ,

 Γ1(f;E)=∥f∥˙C1,1(E).

Recall that . The set and the values are fixed, so the optimization problem is to solve for the gradients that minimize .

2 ˙C1,1(E) Regression

In statistical applications where is observed with uncertainty, one often assumes that we observe , where , and is assumed to be independent and identically distributed Gaussian noise for each . Since both the function values and the gradients are unknown, we minimize an empirical squared error loss over the variables defining the -field. Given a bound on the seminorm of the unknown -field, regression entails solving an optimization problem of the form

 minP 1n∑a∈E(y(a)−Pa(a))2 (11) s.t. ∥P∥˙C1,1(E)≤M.

This is a convex optimization problem: the objective function of the empirical squared error loss in (11) is convex, as is the constraint set since it is a ball specified by a seminorm. This section proceeds as follows: we begin by analyzing the sample complexity of the function class. These risk bounds establish almost sure convergence of the empirical risk minimizer, and guides the choice of . Given , we next appeal to Vaidya’s algorithm to solve the resulting optimization problem (11). We then apply the efficient algorithm of Herbert-Voss et al. (2017) to compute the optimal interpolating function.

2.1 Sample Complexity and Empirical Risk Minimization

The constant will be chosen via sample complexity arguments. To this end, we derive uniform risk bounds for classes of continuous functions , where denotes the unit Euclidean ball in . The function classes of interest are defined in terms of -norm balls as

 F˜M={f∣∥f∥C1,1(Bd)≤˜M},

where we are using the norm

 ∥f∥C1,1(Bd):=max{supx∈Bd|f|,supx∈Bd|∇f|,Lip(∇f)}. (12)

We note that in order to derive the uniform risk bounds for the function classes , we require such classes be compact, which necessitates the choice of the norm in equation (12) as opposed to the seminorm.

With some abuse of notation, in this section we let denote the underlying function from which we observe noisy samples. We observe an i.i.d. sample

 S={(x1,y1),…,(xn,yn)}

drawn from a probability distribution supported on under the assumption

 SY|X ∼N(f∗(X),σ2),where ∥f∗∥C1,1(Bd)=M∗.

Since we are in the regression setting, we use squared error loss

 L(f(x),y)=(f(x)−y)2.

The true risk is defined as the expectation of over :

 R(f)=EP[L(f(x),y)],

and the empirical risk is the expectation over , the empirical distribution on the sample :

 ˆR(f)=ESS[L(f(x),y)].

In order for the empirical risk minimization procedure to converge to a minimizer of the true risk, we need to bound

 supf∈F˜M∣∣ˆR(f)−R(f)∣∣

with high probability. The most natural way to do so is by expanding the risk and appealing to entropy methods (i.e., covering number bounds) and standard concentration results. Recall that the covering number is the minimum number of norm balls of radius needed to cover a function class . We briefly discuss how this is useful toward deriving uniform bounds.

Given a class of bounded functions , an i.i.d. sample drawn from a probability distribution supported on , and a vector of i.i.d. Rademacher random variables , the following holds for :

 P⎡⎣supg∈G∣∣EQTg−EQg∣∣<2Rn(G)+√2log(2/δ)n⎤⎦>1−δ,

where is the empirical distribution on and

 Rn(G)=Eσ1n[supg∈G(n∑i=1σig(xi))]

is the Rademacher average conditional on the sample. Sridharan and Srebro (2010) show that

 Rn(G)≤infγ≥0⎧⎨⎩4γ+12∫supg∈G∥g∥∞γ√logN(η,G,∥⋅∥L2(QT))ndη⎫⎬⎭,

where the right-hand side is a modified version of Dudley’s entropy integral.

Since we are interested in bounding the risk, we use Lemma 5 to relate the Rademacher complexity of the loss class and the original class. We provide a proof based on three results due to Bartlett and Mendelson (2003); these require familiarity with McDiarmid’s inequality, stated below in Lemma 4.

Lemma 4 (McDiarmid’s inequality, McDiarmid, 1989).

Let be independent random variables that take values in a set . Suppose the function satisfies

 supx1,…,xn,x′i∈A∣∣f(x1,…,xn)−f(x1,…,xi−1,x′i,xi+1,…,xn)∣∣≤ci

for every . Then, for ,

 P[|f(X1,…,Xn)−Ef(X1,…,Xn)|≥t]≤2e−2t2/∑ni=1c2i.
Lemma 5.

Let be a class of functions with . Let be a bounded loss function with Lipschitz constant and . Then, the following is true for :

 P⎡⎣supf∈F˜M∣∣R(f)−ˆR(f)∣∣<4LLRn(F˜M)+7Lmax√log(8/δ)2n⎤⎦>1−δ.
Proof.

In this lemma, we begin by adapting Theorem 8 of Bartlett and Mendelson (2003) to find a bound on the risk that depends on a probabilistic term plus the expectation of the Rademacher average of the class of loss functions. We follow the proof of Lemma 4 of Bousquet et al. (2004) for guidance. We apply the two-sided form of McDiarmid’s inequality as we want bounds on the absolute value of , and appeal to Theorems 11 and 12 of Bartlett and Mendelson (2003) to relate the expected Rademacher average of the loss class to the empirical Rademacher average of .

Let be the class of functions consisting of . If , then . For any , the triangle inequality shows that

 ∣∣R(f)−ˆR(f)∣∣≤suph∈˜L∘F˜M∣∣Eh−ˆEnh∣∣+∣∣EL(0,y)−ˆEnL(0,y)∣∣.

McDiarmid’s inequality yields more favorable expressions for both terms on the right-hand side as follows. The most that can change by altering one sample is . Since , we have, with probability ,

 ∣∣EL(0,y)−ˆEnL(0,y)∣∣≤√L2maxlog8/δ2n.

The most that can change with an alteration of one sample is . Therefore, with probability ,

 ∣∣ ∣∣suph∈˜L∘F˜M∣∣Eh−ˆEnh∣∣−Esuph∈˜L∘F˜M∣∣Eh−ˆEnh∣∣∣∣ ∣∣≤√4L2maxlog8/δ2n.

Now,

 Esuph∈˜L∘F˜M∣∣Eh−ˆEnh∣∣≤max⎧⎨⎩Esuph∈˜L∘F˜M(Eh−ˆEnh),Esuph∈˜L∘F˜M(ˆEnh−Eh)⎫⎬⎭.

Let be a sample with the same distribution as . Conditioning on the original sample,

 Esuph∈˜L∘F˜M(Eh−ˆEnh) =Esuph∈˜L∘F˜ME[1nn∑i=1h(x′i,y′i)−ˆEnh∣∣∣S] ≤Esuph∈˜L∘F˜M(1nn∑i=1h(x′i,y′i)−ˆEnh) =Esuph∈˜L∘F˜M1nn∑i=1σi(h(x′i,y′i)−h(xi,yi)) ≤Esuph∈˜L∘F˜M1nn∑i=1σih(x′i,y′i)+Esuph∈˜L∘F˜M1nn∑i=1−σih(xi,yi) =2ERn(˜L∘F˜M).

The second line follows by applying Jensen’s inequality to , which is convex. Note the preceding argument is symmetric in and . Therefore, has the same upper bound and, with probability ,

 ∣∣R(f)−ˆR(f)∣∣≤2ERn(˜L∘F˜M)+3Lmax√log(8/δ)2n.

Theorem 11 of Bartlett and Mendelson (2003) uses McDiarmid’s inequality to bound the difference between the empirical and expected Rademacher averages, but assumes that we are interested in the Rademacher complexity of a class of functions mapping to . Since maps to , we rederive the analogous result here. The most that one sample affects is . We have

 P[∣∣Rn(˜L∘F˜M)−ERn(˜L∘F˜M)∣∣≥t]≤2e−2nt2/(4L2max).

Thus, with probability ,

 2ERn(˜L∘F˜M) ≤2Rn(˜L∘F˜M)+4Lmax√log(4/δ)2n ≤4LLRn(F˜M)+4Lmax√log(8/δ)2n.

The second line follows from part 4 of Theorem 12, which states that, for with Lipschitz constant and satisfying , . The reasoning from the proof also applies to the empirical Rademacher average, giving . Since has the same Lipschitz constant as , we use the notation .

Finally, with probability at least ,

 ∣∣R(f)−ˆR(f)∣∣≤4LLRn(F˜M)+7Lmax√log(8/δ)2n.

Next, the following lemma gives an upper bound on the covering number of with respect to the supremum norm.

Lemma 6 (adapted from Theorem 2.7.1 of Van Der Vaart and Wellner, 1996).

There exists a constant depending only on such that, for every ,

 logN(η,F˜M,∥⋅∥∞)≤K(˜Mη)d/2.
Proof.

Following Van Der Vaart and Wellner (1996), every is continuous on the open unit ball by assumption, so Taylor’s theorem applies everywhere. Fix , and take the -net of points in , where the number of points is less than or equal to the volume of times a constant that only depends on . Then for all vectors whose sum of entries do not exceed (this includes the zero vector and standard basis vectors), define for each the vector

The vector thus consists of the values discretized on a mesh with grid width .

Now, if are such that for all , then for a constant , implying that for each there exists an such that . The remainder term in the Taylor expansion

 (f−g)(x)=∑k≤1Dk(f−g)(xi)(x−xi)kk!+R

is bounded by the mesh width : indeed, we may consider an integral form via the Fundamental Theorem of Calculus:

 (f−g)(x) =(f−g)(xi)+∫xxiD(f−g)(s)ds =(f−g)(xi)+∫xxiD(f−g)(xi)+D(f−g)(s)−D(f−g)(xi)ds =(f−g)(xi)+D(f−g)(x−xi)+∫xxiD(f−g)(s)−D(f−g)(xi)ds ≤(f−g)(xi)+D(f−g)(x−xi)+C∥x−xi∥2

where the last line follows from . Therefore we see that the remainder term , and we may next substitute the mesh width bounding this quantity:

 |f−g|(x)∝∑¯¯¯k≤1δ2−¯¯¯kd∏i=1δ¯¯¯kiki!+δ2≤δ2(ed+1).

Thus, there exists such that the covering number is bounded by the number of different matrices whose rows are the vectors for such that and ranges over . There are such vectors. Now, by definition of and using for all the number of values of each element in each row is at most . Thus, each column of has at most values. Note that this already suffices to produce a finite bound. Following Van Der Vaart and Wellner (1996) and applying Taylor’s theorem again yields a less crude bound where is a constant depending only on . We may replace in this expression by and by its upper bound , where represents the -dimensional volume. Now, the lemma follows by taking logarithms, bounding by , and combining all constant terms into . ∎

We are now ready to provide risk bounds in the following theorem.

Theorem 7.

Suppose we set , where , and let be the class of functions with -norm bounded above by .

1. For ,

 P[supf∈F˜M∣∣R(f)−ˆR(f)∣∣<ε]>(1−δ)(1−e−n2/max{d,5}/2σ2),

where is a monotonically-decreasing function of for large enough and .

2.  supf∈F˜M∣∣R(f)−ˆR(f)∣∣a.s.−−→0.
Proof.

is a sequence of function classes with increasing -norm. We set the rate , where , so that is a candidate for large enough . We aim to use Lemma 5 to prove the desired probability statement, but our loss function is unbounded since can be arbitrarily large. To circumvent this, we also let the maximum value of increase with ; samples violating this condition are part of the error probability. Write , where . We condition on the event . Theorem 7.1 of Ledoux (2005) gives the following bound for suprema of Gaussian processes:

 P[max1≤i≤n|ξi|1−e−r2/2σ2.

The following is well-known (Boucheron, Lugosi, and Massart, 2013):

 Emax1≤i≤n|ξi|≤σ√2log2n.

Thus, .

Since the loss function is bounded after conditioning on , we can compute:

 Lmax

We also find the Lipschitz constant as follows, where :

 supx,y,f1,f2∣∣(f1(x)−y)2−(f2(x)−y)2∣∣ =supx,y,f1,f2|(−2y+f1(x)+f2(x))(f1(x)−f2(x))| ≤supx,y,f1,f2|−2y+f1(x)+f2(x)|∥f1−f2∥∞.

This implies:

 LL ≤supx,f1,f2|f1(x)+f2(x)|+2supy|y| <2(