1 Introduction
###### Abstract

Given a weighted graph with vertices, consider a real-valued regression problem in a semi-supervised setting, where one observes labeled vertices, and the task is to label the remaining ones. We present a theoretical study of -based Laplacian regularization under a -dimensional geometric random graph model. We provide a variational characterization of the performance of this regularized learner as grows to infinity while stays constant; the associated optimality conditions lead to a partial differential equation that must be satisfied by the associated function estimate . From this formulation we derive several predictions on the limiting behavior the -dimensional function , including (a) a phase transition in its smoothness at the threshold ; and (b) a tradeoff between smoothness and sensitivity to the underlying unlabeled data distribution . Thus, over the range , the function estimate is degenerate and “spiky,” whereas for , the function estimate is smooth. We show that the effect of the underlying density vanishes monotonically with , such that in the limit , corresponding to the so-called Absolutely Minimal Lipschitz Extension, the estimate is independent of the distribution . Under the assumption of semi-supervised smoothness, ignoring can lead to poor statistical performance; in particular, we construct a specific example for to demonstrate that has lower risk than due to the former penalty adapting to and the latter ignoring it. We also provide simulations that verify the accuracy of our predictions for finite sample sizes. Together, these properties show that is an optimal choice, yielding a function estimate that is both smooth and non-degenerate, while remaining maximally sensitive to .

Asymptotic behavior of -based Laplacian regularization in semi-supervised learning

 Ahmed El Alaoui⋆ Xiang Cheng⋆ Aaditya Ramdas⋆,†
 Martin J. Wainwright⋆,† Michael I. Jordan⋆,†

 Department of Electrical Engineering and Computer Sciences⋆, and Department of Statistics†, UC Berkeley, Berkeley, CA 94720.

Keywords: -based Laplacian regularization; semi-supervised learning; asymptotic behavior; geometric random graph model; absolutely minimal Lipschitz extension; phase transition.

## 1 Introduction

Semi-supervised learning is a research field of growing interest in machine learning. It is attractive due to the availability of large amounts of unlabeled data, and the growing desire to exploit it in order to improve the quality of predictions and inference in downstream applications. Although many proposed methods have been successful empirically, a formal understanding of the pros and cons of different semi-supervised methods is still incomplete.

The goal of this paper is to study the tradeoffs between some recently proposed Laplacian regularization algorithms for graph-based semi-supervised learning. In the noiseless setting, the problem amounts to a particular form of interpolation of a graph-based function. More precisely, consider a graph where is a set of vertices, and is the set of edges equipped with a set of non-negative edge weights. For some subset of the vertex set, say with cardinality , and an unknown function , suppose that we are given observations of the function at the specified subset of vertices. Our goal is to use the observed values to make predictions of the function values at the remaining vertices in a way that agrees with as much as possible.

In order to render this problem well-posed, the behavior of the function must be tied to the properties of the graph . In a statistical context, one such requirement is such that the marginal distribution of the points be related to the regression function —for instance, by requiring that be smooth on regions of high density. This assumption and variants thereof are collectively referred to as the cluster assumption in the semi-supervised learning literature.

Under such a graph-based smoothness assumption, one reasonable method for extrapolation is to penalize the change of the function value between neighboring vertices while agreeing with the observations. A widely used approach involves using the -based Laplacian as a regularizer; doing so leads to the objective

 minf∑ij∈Ewij(f(vi)−f(vj))2subject to f(vi)=yi for all i∈O, (1)

where the penalization is enforced by the quadratic form given by the graph Laplacian [ZGL03]. This method is closely tied to heat diffusion on the graph and has a probabilistic interpretation in terms of a random walk on the graph. Unfortunately, solutions of this objective are badly behaved in the sense that they tend to be constant everywhere except for the points associated with observations [NSZ09]. The solution must then have sharp variations near those points in order to respect the measurement constraints.

Given this undesirable property, several alternative methods have been proposed in recent work (e.g., [AL11, BZ13, ZB11, KRSS15]). One such formulation is based on interpolating the observed points exactly while penalizing the maximal gradient value on neighboring vertices:

 minfmaxij∈Ewij∣∣f(vi)−f(vj)∣∣subject to f(vi)=yi for all i∈O. (2)

Any solution to this variational problem is known as an inf-minimizer. [KRSS15] recently proposed a fast algorithm for solving the optimization prolblem (2). In fact, their algorithm finds a specific solution that not only minimizes the maximum gradient value, but also the second largest one among all minimizers of the former and so on—that is to say, they find a solution such that the gradient vector is minimal in the lexicographic ordering, which they refer to as the lex-minimizer. They observed empirically that when grows to infinity while both the degree of the graph and number of observations are held fixed, the lex-minimizer is a better behaved solution than its -Laplacian counterpart. More precisely, the observed advantage is twofold: (a) the solution is a better interpolation of the observed values; and (b) the average -error for the lex-minimizer remains stable, while it quickly diverges with when is the -Laplacian minimizer. Their experiments together with the known limitations of the -Laplacian regularization method point to the possible superiority of -based formulation (2) over the -based (1) in a semi-supervised setting. However, we currently lack a theoretical understanding of this assertion. Accordingly, we aim to fill this gap in the theoretical understanding of Laplacian-based regularization by studying both formulations in the asymptotic limit as the graph size goes to infinity.

We conduct our investigation in the context of a more general objective that encompasses the approaches (1) and (2) as special cases. In particular, for a positive integer , we consider the variational problem

 Jp(f)=∑ij∈Ewpij∣∣f(vi)−f(vj)∣∣p. (3)

The objective is referred to as the (discrete) -Laplacian of the graph in the literature; for instance, see the papers by [ZS05] and [BH09], as well as references therein. It offers a way to interpolate between the 2-Laplacian regularization method and the inf-minimization approach. It is then natural to consider the general family of interpolation problems based on -Laplacian regularization—namely

 minf Jp(f)subject to f(vi)=yi for i∈O. (4)

Formulations (1) and (2) are recovered respectively with and . Indeed, in the latter case, observe that . Moreover, under certain regularity assumptions ([EH90, KRSS15]), it follows that the lex-minimizer is the limit of the (unique) minimizers of as grows to infinity111The construction of this sequence of minimizers is known as the Pólya algorithm, and the study of its rate of convergence is a classical problem in approximation theory ([DLT83, ET87, LT89, EH90]).—that is, we have the equivalence

 flex=limp→∞ argminu  Jp(u)subject to u(vi)=yi for all i∈O. (5)

#### Our contributions:

We analyze the behavior of -Laplacian interpolation when the underlying graph is drawn from a geometric random model. Our first main result is to derive a variational problem that is the almost-sure limit of the formulation (4) in the asymptotic regime when the size of the graph grows to infinity. For a twice differentiable function , we use and to denote its gradient and Hessian, respectively. Letting denote the density of the vertices in a latent space, we show that for any even integer , solutions of this variational problem must satisfy the partial differential equation

 Δ2f(x)+2⟨∇logμ(x),∇f(x)⟩+(p−2)Δ∞f(x)=0,

where is the usual -Laplacian operator, while is the -Laplacian operator, which is defined to be zero when .

This theory then yields several predictions on the behavior of these regularization methods when the number of labeled examples is fixed while the number of unlabeled examples becomes infinite: the method leads to degenerate solutions when ; i.e., they are discontinuous, a manifestation of the curse of dimensionality. On the other hand, the solution is continuous when . The solution is dependent on the underlying distribution of the data for all finite values of ; however, when , the solution is not dependent on the underlying density . Consequently, as the graph size increases, the lex- and inf-minimizers end up interpolating the observed values without exploiting the additional knowledge of the density of the features that is provided by the abundance of unlabeled data.

In order to illustrate the consequences of this last property, we study a simple one-dimensional regression problem whose intrinsic difficulty is controlled by a parameter . We show that the 2-Laplacian method has an estimation rate independent of while the infinity-minimization approach has a rate that is increasing in . As shown by our analysis, this important difference can be traced back to whether or not the method leverages the knowledge of . We also provide an array of experiments that illustrate some pros and cons of each method, and show that our theory predicts these behaviors accurately. Overall, our theory lends support to using intermediate value of that will lead to non-degenerate solutions while remaining sensitive to the underlying data distribution.

## 2 Generative Model

We follow the popular assumption in the semi-supervised learning literature that the graph represents the metric properties of a cloud point in -dimensional Euclidean space ([ZGL03, BCH03, BN04, Hei06a, NSZ09, ZB11]). More precisely, suppose that we are given a probability distribution on the unit hypercube having a smooth density with respect to the Lebesgue measure, as well as a bounded decreasing function such that . We then draw an i.i.d. sequence of samples from ; these vectors will be identified with the vertices of the graph : . Finally, we associate to each pair of vertices the edge weight , where is a bandwidth parameter. We use to denote the random graph generated in this way.

#### Degree asymptotics

Given a sequence of graphs generated as above, we study the behavior of the minimizers of in the limit . A first step is to understand the behavior of the graph itself, and in particular its degree distribution in this limit. In order to gain intuition for this issue. consider the special case when . If the bandwidth parameter is held fixed, then any sequence of points that fall in a ball of radius will form a clique. Thus, the graph will contain roughly cliques, each with approximately vertices. It is typically desired that the sequence of graphs be sparse with an appropriate degree growth (e.g., constant or logarithmic growth) so that it converges to the underlying manifold. In order to enforce this behavior, the bandwidth parameter should tend to zero as the sample size increases.

Under this assumption, it can be shown that the scaled degree at any vertex , given by

 d(x)=1NhdN∑i=1φ(∥xi−x∥2h),

concentrates around . A precise statement can be found in [Hei06a]; roughly speaking, it follows from the fact that for any fixed point in , as goes to zero and under a smoothness assumption on the density , the probability that a random vector falls in the -neighborhood of scales as .

## 3 Variational problem and related PDE

In this section, our main goal is to study the behavior of the solution in the limit as the sample size and the bandwidth . As discussed above, it is natural to consider scalings under which decreases in parallel with the increase in . However, for simplicity, we follow [NSZ09] and first take the sample size to infinity with the bandwidth held constant, and then let go to zero.222In the case , it is known that the same limiting objects are recovered when a joint limit in and is taken with but ([Hei06a]). Based on the discussion above, this scaling implies a super-logarithmic degree sequence. It is still to be verified if the same result holds for all . Our first result characterizes the asymptotic behavior of the objective .

###### Theorem 3.1.

Let be continuously differentiable with a bounded derivative, and let be a bounded density. Then for any even integer , we have

 Ip(f):=limh→0 limN→∞ 1N2 hp+d Jp(f)=Cp∫∥∇f(x)∥p2μ2(x)dx, (6)

where .

We provide the proof in Appendix A; it involves applying the strong law for -statistics, as well as an auxiliary result on the isotropic nature of the integral of a rank-one tensor.

Based on Theorem 3.1, the asymptotic limit of the semi-supervised learning problem is a supervised non-parametric estimation problem with a regularization term given by the functional —namely, the problem

 infg∫∥∇g(x)∥p2μ2(x)dxsubject to g(xi)=yi for all i∈O. (7)

Our next main result characterizes the solutions of this optimization problem in terms of a partial differential equation known as the (weighted) -Laplacian equation. Here the word “weighted” refers to the term in the functional (7) ([HKM12, Obe13]).

Let us introduce various pieces of notation that are useful in the sequel. Given a vector field , we use to denote denote its divergence. For a scalar-valued function , we let

 Δ2f=div(∇f)=d∑i=1∂2xif,andΔ∞f=⟨∇f, ∇2f ∇f⟩⟨∇f,∇f⟩=1∥∇f∥22d∑i,j=1∂xif⋅∂xi,xjf⋅∂xjf

denote the (standard) -Laplacian operator and the -Laplacian operator, respectively.

###### Theorem 3.2.

Suppose that the density is bounded and continuously differentiable. Then any twice-differentiable minimizer of the functional (7) must satisfy the Euler-Lagrange equation

 div(μ2(x)∥∇f(x)∥p−22∇f(x))=0. (8a) If moreover the distribution μ has full support, then equation (8a) is equivalent to Δ2f(x)+2⟨∇logμ(x),∇f(x)⟩+(p−2)Δ∞f(x)=0. (8b)

The proof employs standard tools from calculus of variations [GF63]. We note here that does not need to be twice differentiable for the above result to hold ([HKM12]) in which case equations (8a) and (8b) have to be understood in the viscosity sense ([CEG01, AS10]). Twice differentiability is assumed so only for ease of the proof (see Appendix B).

When is the uniform distribution, equation (8b) reduces to the partial differential equation (PDE)

 Δ2f(x)+(p−2)Δ∞f(x)=0,

which is known as the -Laplacian equation and often studied in the PDE literature ([HKM12, Obe13]). If one divides by and lets , one obtains the infinity-Laplacian equation

 Δ∞f(x)=0, (9)

subject to measurement constraints for . This problem has been studied by various authors (e.g., [CEG01, ACJ04, PSSW09, AS10]). Note that in dimension , we have , and equation (8b) reduces to

 (p−1)μ(x)f′′(x)+2μ′(x)f′(x)=0.

Therefore, if we specialize to the case , the 2-Laplacian regularization method solves the differential equation , whereas if we specialize to , then the inf-minimization method solves the differential equation . Note that the two equations coincide only when is the uniform distribution, in which case is uniformly zero.

## 4 Insights and predictions

Our theory from the previous section allows us to make a number of predictions about the behavior of different regularization methods, which we explore in this section.

### 4.1 Inf-minimization is insensitive to μ

Observe that the effect of the data-generating distribution has disappeared in equation (9). One could also see this by taking the limit in the objective to be minimized in Theorem 3.1—in particular, assuming that has full support, we have .

From the observations above, one can see that in the limit of infinite unlabeled data—i.e., once the distribution is available—the 2-Laplacian regularization method, as well as any -Laplacian method based on finite (even) , incorporates knowledge of in computing the solution; in contrast, for , the inf-minimization method does not (see Figure 2). On the other hand, it has been shown that the -Laplacian method is badly behaved for in the sense that the solution tends to be uninformative (constant) everywhere except on the points of observation. The solution must then have sharp variations on those points in order to respect the measurement constraints (see Figure 1 for an illustration of this phenomenon). We show in the next section that this problem plagues the -Laplacian minimization approach as well whenever .

### 4.2 p-Laplacian regularization is degenerate for p≤d

In this section, we show that -Laplacian regularization is degenerate for all . This issue was originally addressed by [NSZ09], who provided an example that demonstrates the degeneracy for for . Here we show that the underlying idea generalizes to all pairs with . Recall that Theorem 3.1 guarantees that

 Ip(f) :=limh→0 limN→∞ 1N2 hp+d Jp(f)=Cp∫∥∇f(x)∥p2μ2(x)dx.

In the remainder of our analysis, we treat the cases and separately.

#### Case p≤d−1:

Beginning with the case , we first set and then let be any point on the unit sphere (i.e., ). Define the function for some , and let the observed values be for . Using the fact that and assuming that is uniformly upper bounded by on , we have

 Ip(fϵ)=∫B(0,ϵ)μ2(x)ϵpdx≤μ2maxϵpvol(B(0,ϵ))=μ2maxvol(B(0,1))ϵd−p,

where denotes the Euclidean ball of radius centered at the origin, and denotes the Lebesgue volume in . Consequently, we have , so the infimum of is achieved for the trivial function that is everywhere except at the origin, where it takes the value . The key issue here is that grows at a rate of while the measure of the “spike” in the gradient shrinks at a rate of .

#### Case p=d:

On the other hand, when , then we take , for which we also have and . With this choice, we have

 Ip(fϵ) =1log(1+ϵϵ)d∫B(0,1)∥x∥d2(∥x∥22+ϵ)dμ2(x)dx ≤μ2maxlog(1+ϵϵ)d∫B(0,1)∥x∥d2(∥x∥22+ϵ)ddx \lx@stackrel(ii)=d μ2maxvol(B(0,1))log(1+ϵϵ)d∫10ud−1(u+ϵ)ddu \lx@stackrel(iii)≤d μ2maxvol(B(0,1))2log(1+ϵϵ)d−1,

where step (i) follows from a change of variables from to the radial coordinate ; step (ii) follows by the variable change ; and step (iii) follows by upper-bounding by in the numerator inside the integral. Again, we find that . Thus, in order to avoid degeneracies, it is necessary that .

It is worth noting that [AL11] studied the problem of computing the so-called -resistances of a graph, which are a family of distances on the graph having a formulation similar —in fact, dual— to the -Laplacian regularization method considered in the present paper, and where . They established a phase transition in the value of for the geometric random graph model, where above the threshold , the -resistances “[…] depend on trivial local quantities and do not convey any useful information […]” about the graph, while below the threshold , these resistances encode interesting properties of the graph. They conclude by suggesting the use of -Laplacian regularization with . The latter condition can be read . However, as shown by the examples above, this choice is still problematic, and in fact, the choice is the smallest admissible value for .

We also note that the example for extends to an arbitrary number of labeled points: one simply has a spike for each point. Undesirable behavior arises as long as the set of observed points is of measure zero. Finally, we note that both the above example can be adapted to the case where the squared loss is optimized along with the regularizer instead of imposing the hard constraint (see Appendix D). The issue is that the regularizer is too weak and allows to choose the solution from a very large class of functions.

### 4.3 p-Laplacian solution is smooth for p≥d+1

At this point, a natural question is whether the condition is also sufficient to ensure that the solution is well-behaved. In a specific setting to be described here, the answer is yes333Interestingly enough, the -Laplacian equation has been extensively studied in non-linear potential theory. It is in fact the prototypical example of a non-linear degenerate elliptic equation. The regularity of the solutions is well understood for any real number (see e.g. [HKM12]). For our purposes however, we do not need the full power of this theory.. The underlying reason is the Sobolev embedding theorem. More precisely, let denote the weighted Sobolev space of all (weakly) differentiable functions on such that the semi-norm

 ∥f∥1,p:=(∫∥∇f(x)∥p2μ2(x)dx)1/p<∞.

If, moreover, we assume is strictly positive almost everywhere and restrict the above class to functions vanishing on the boundary, then actually defines a norm. When , and under additional regularity conditions on (e.g. upper- and lower-bounded by constants a.e.), the space can be embedded continuously into the space of Hölder functions of exponent , i.e. functions such that for all for some dimension-dependent constant . For details, see Theorem 11.34 of [Leo09] or Lemma 5.17 of [AF03].  [BO92] provide some relaxed conditions on . Since the minimizer of is such that , the function is in the Sobolev class , and therefore it automatically inherits the Hölder smoothness property, i.e. the -Laplacian solution is smooth for , asymptotically as 444If the graph is finite, then the solution might still contain small spikes as apparent in Figures 1 and 2.. Incidentally, via the examples in the previous section, it is clear that no such embedding exists if .

### 4.4 An example where inf-minimization interpolates well

By extension to the case , the infinity-Laplacian solutions also enjoy continuity (this solution is actually Lipschitz based on its interpretation as the absolutely minimal Lipschitz extension of the observations ([ACJ04]). It was also argued [KRSS15], based on experimental results, that the inf-minimization method has a better behavior in higher dimensions in terms of faithfulness to the observations. We illustrate this point by considering a simple example, similar to the one above, for which the -Laplacian equation (9) produces a sensible solution. With and denoting the Euclidean unit sphere, suppose that we limit ourselves to functions satisfying the observation constraints and for all .

Without any further information on the data-generating process, a reasonable fit is the function . We claim that it is the only radially symmetric solution to the differential equation with the boundary constraints and for all . In order to verify this claim, let for a function . For any non-zero , we have

 ∇f(x)=g′(∥x∥2)x∥x∥2,% and∇2f(x)=g′′(∥x∥2)xx⊺∥x∥22+g′(∥x∥2)I∥x∥2−g′(∥x∥2)xx⊺∥x∥32.

Then

 Δ∞f(x)=1∥∇f(x)∥22∇f(x)⊺∇2f(x)∇f(x)=g′′(∥x∥2).

Given the boundary conditions on , the only solution to = 0, is given by , meaning that . On the other hand, the latter is not a solution to , unless .

In summary, this section reveals a trade-off between smoothness and sensitivity to the data-generating density in -Laplacian regularization: the solution is strongly sensitive to but is non-smooth for small values of , while it is smooth but weakly dependent on for large and infinite values of . The transition from degeneracy to smoothness happens at a sharp threshold , while the dependence on weakens with larger and larger without a threshold.

While the property of smoothness is an obvious quality in an estimation setting, and may lead to improved statistical rates if assumed first hand —especially if the signal one wishes to recover is itself smooth— it is less obvious how to quantify the advantages entailed by the sensitivity to the underlying data-generating density; especially when the latter is available and is to be incorporated in the design of an estimator. We provide in the next section a simple, one-dimensional regression example where the regression function is tied to the density via the cluster assumption, and where a difference in estimation rates between the and methods is exhibited. This difference is explicitly due to the fact that the method leverages the knowledge of while does not.

## 5 The price of “forgetting” μ

We consider in this section a simple estimation example in one dimension where the asymptotic formulation of 2-Laplacian regularization method achieves a better rate of convergence than that of the inf-minimization method. Such an advantage of the method over the method should be conceivable under the cluster assumption: the regularizer will encodes information about the target function via while the regularizer does not.

Let the target function and the data-generating density be supported on the interval . For some small , we construct a density that takes a uniformly small value over the interval , and takes and a large value on the complementary set . We also let have a high Lipschitz constant on the interval and be constant otherwise. More precisely, the density and function are constructed as follows:

 μ(x)={bx∈[−ϵ,ϵ],ax∈[−1,−ϵ)∪(ϵ,1],  f∗(x)=⎧⎨⎩−1x∈[−1,−ϵ),x/ϵx∈[−ϵ,ϵ],1x∈(ϵ,1]. (10)

The constants and are related by the equation so that the density integrates to 1, and we think of as being much smaller than , i.e. . Consider the following two classes of functions corresponding to the regularizer for and respectively:

 H L :={f:[−1,1]→R , f % absolutely continuous, odd and sup|x|≤1|f′(x)|<∞}.

We define the associated norms and on and respectively. Observe that while is upper bounded by a constant:

 ∫1−1[(f∗)′(x)]2μ2(x)dx=∫ϵ−ϵ1ϵ2b2dx=2b2/ϵ.

Taking , the above integral is bounded above by .

We draw points independently from and observe the responses where are i.i.d. standard normal random variables, . We compare the following two M-estimators:

 ˆfH=argminf 1nn∑i=1(yi−f(xi))2s.t.\  f∈H , ∥f∥H≤2, and ˆfL=argminf 1nn∑i=1(yi−f(xi))2s.t.\  f∈L , ∥f∥L≤1/ϵ,

in terms of the rate of decay of the error , where . Note that this error can in principle tend to zero as grows to infinity since the target belongs to the hypothesis class in both of the considered cases, i.e. there is no approximation error.

###### Theorem 5.1.

There are universal constants such that for any , the estimator satisfies the bound

 ∥∥^fH−f∗∥∥2n ≤c1(σ2n)2/3 (11)

with probability at least . On the other hand, the estimator satisfies the bound

 ∥∥ˆfL−f∗∥∥2n ≤c3(σ2ϵ n)2/3 (12)

with probability at least .

We can thus compare upper bounds on the rate of estimation of the and methods respectively. The upper bound (12) shows a dependence in , while the bound (11) shows no dependence on . One might ask if these bounds are tight. In particular, a question of interest is whether the estimator adapts to the target without the need to know the density , in which case the corresponding estimator would achieve a better rate. While we think our bounds could be sharpened, we strongly suspect that the estimator cannot achieve a rate independent of (in contrast to the estimator). We provide an array of simulations showing that the rate of deteriorates as gets small, where the rate of the method stays the same (see Figure 3).

### 5.1 Main ideas of the proof

The first non-asymptotic bound (12) in Theorem 5.1 follows in a straightforward way from known results on the minimax rate of estimation on the class of Lipschitz functions. Indeed, the rate of estimation on this class with Lipschitz constant is , and in our case . On the other hand, the second result (11) follows by recognizing that the class is a weighted Sobolev space of order 1, which is a Reproducing Kernel Hilbert Space (RKHS). The associated kernel, as identified by [NSZ09], is given by

 K(x,y)=14∫1−1dt/μ2(t)−12∣∣∣∫yxdt/μ2(t)∣∣∣,for all x,y∈[−1,1]. (13)

It is known that the rate of estimation on a ball of radius of an RKHS is tightly related to the decay of the eigenvalues of the kernel. More precisely, the rate of estimation is upper-bounded with high probability by the smallest solution to the inequality

 (2n∞∑j=0min{γj,δ2})1/2≤Rσ δ2, (14)

where is the sequence of eigenvalues of the kernel  ([Kol06, Men02, BBM02, vdG00]). Our next result upper-bounds the rate of decay of these eigenvalues.

###### Lemma 5.1.

For any , the eigenvalues of the kernel form a doubly indexed sequence with , . This sequence satisfies the upper bound

 γk,j ≤⎧⎨⎩1.26if k=j=01(k2√2+jϵ−3/4)2π2% otherwise.

Plugging these estimates in equation (14) leads to the rates we claim in Theorem 5.1. The full details are given in Appendix C.

## 6 Related work

The discrete graph -Laplacian was introduced by [ZS05] as a form of regularization generalizing classical Laplacian regularization in semi-supervised learning [ZGL03]. We mention however that the continuous -Lapalcian has been extensively studied much earlier in PDE theory [HKM12, ACJ04]. It was also used and analyzed for spectral clustering, where it provides a family of relaxations to the normalized cut problem [Amg03, BH09, LHDN10]. A dual version of the problem (the -resistances or -volatges problem) was investigated and shown to yield improved classification performance in [BZ13]. [AL11] prove the existence of a phase transition under the geometric graph model roughly similar to the one exhibited in this paper, although the thresholds are slightly different. The exact nature of the connection is still unclear however. On the other hand, a game-theoretic interpretation of the -Laplacian solution is studied in [PS08, PSSW09], and a similar transition at in the behavior of the game is found. The assumption that the graph entails a geometric structure is popular in the analysis of semi-supervised learning algorithms [BN01, BCH03, BN04, Hei06a, Hei06b, NSZ09]. This line of work have mostly focused on the -Laplacian formulation and its convergence properties to a differential operator on the limiting manifold.

Among other approaches that circumvent the degeneracy issue discussed in the paper, we mention higher order regularization [ZB11], where instead of only penalizing the first derivative of the function, one can penalize up to derivatives. This approach considers solutions in a higher order Sobolev space which, via the Sobolev embedding theorem, only contains smooth functions if (see [AF03, Leo09]). This approach can be implemented algorithmically using the discrete iterated Laplacian [ZB11, WSST15].

Results on statistical rates for semi-supervised learning problems are very sparse. The first results are covered in [CC96] in the context of mixture models, [Rig07] in the context of classification, and [LW07] for regression. A recent line of work considers the setting where the graph is fixed while the set of vertices where the labels are available is random [AZ07, JZ07, JZ08, SB14, SCS15]. The methods studied in this setting generalize the 2-Laplacian method by penalizing by the quadratic form given by a general positive semidefinite kernel. The derived rates depend on the structural properties of the graph and/or the kernel used. It is shown in particular that using the normalized Laplacian instead of the regular one leads to a better statistical bound, and on the other hand, one can obtain rates depending on the Lovász Theta function of the graph by choosing the regularization kernel optimally.

## 7 Conclusion and open problems

In this paper, we used techniques and ideas from PDE theory to yield insight into the behavior of various methods for semi-supervised learning on graphs. The -dimensional geometric random graph model analyzed in this paper is a common one in the literature, and the most common Laplacian penalization technique is for , though the choice has also attracted some recent attention. Our paper sheds light on both of these options, as well as the full range of in between. From our asymptotic analysis, we see that for a -dimensional problem, degenerate solutions occur whenever , whereas at the other extreme, the choice leads to solutions that are totally insensitive to the input data distribution. Hence, the choice seems like a prudent one, trading off degeneracy of the solution with sensitivity to the unlabeled data.

An important companion problem is the unconstrained version of the problem, in which we penalize a (weighted) sum of two terms, the -Laplacian term and a sum of squared losses on the labeled data. One can see that under our asymptotics, we can make the same conclusions about the unconstrained solution. Hence, the conclusions of this paper do not hinge on the fact that we modeled the problem with equality constraints, and do apply more generally. For completeness, we outline this argument in Appendix D.

There are perhaps two important assumptions which must be relaxed in future work. The first is the asymptotics we consider, in which the number of labeled points is fixed as the unlabeled points become infinite. An interesting situation is when both labeled and unlabeled points grow at a relative rate. We showed that in the first situation, a certain class of methods, namely all -Laplacian methods with , behave poorly.

An interesting direction is to understand what set of methods are appropriate in different regimes of relative growth rate. In particular, we suspect that most of our results should continue to hold as long as the number of unlabeled points grow at a much faster rate than the number of labeled points. The second assumption is about the geometric random graph model, and how much our results are tied to this model. Finite sample rates and non-asymptotic arguments are necessary to understand how soon we can expect to see these effects on general graphs in practice. Finally, the model selection problem of what to use in practice is very important, since we may not know the underlying dimensionality of the data from which our graph was formed.

#### Acknowledgments:

We thank Jason Lee, Kevin Jamieson, Anup Rao, Sushant Sachdeva and Ryan Tibshirani for discussions in the early phases of this work. This work was partially supported by Air Force Office of Scientific Research grant FA9550-14-1-0016 and Office of Naval Research grant DOD-ONR-N00014 to MJW. This material is based upon work supported in part by the Office of Naval Research under grant number N00014-11-1-0688 and by the Army Research Laboratory and the U. S. Army Research Office under grant number W911NF-11-1-0391to MIJ.

## References

• [ACJ04] Gunnar Aronsson, Michael Crandall, and Petri Juutinen. A tour of the theory of absolutely minimizing functions. Bulletin of the American mathematical society, 41(4):439–505, 2004.
• [AF03] Robert A Adams and John JF Fournier. Sobolev spaces, volume 140. Academic press, 2003.
• [AL11] Morteza Alamgir and Ulrike V Luxburg. Phase transition in the family of -resistances. In Advances in Neural Information Processing Systems, pages 379–387, 2011.
• [Amg03] S Amghibech. Eigenvalues of the discrete -Laplacian for graphs. Ars Combinatoria, 67:283–302, 2003.
• [AS10] Scott N Armstrong and Charles K Smart. An easy proof of Jensen’s theorem on the uniqueness of infinity harmonic functions. Calculus of Variations and Partial Differential Equations, 37(3-4):381–384, 2010.
• [AZ07] Rie Kubota Ando and Tong Zhang. Learning on graph with Laplacian regularization. Advances in neural information processing systems, 19:25, 2007.
• [BBM02] Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson. Localized Rademacher complexities. In Computational Learning Theory, pages 44–58. Springer, 2002.
• [BCH03] Olivier Bousquet, Olivier Chapelle, and Matthias Hein. Measure based regularization. In Advances in Neural Information Processing Systems, pages 1221–1228, 2003.
• [BH09] Thomas Bühler and Matthias Hein. Spectral clustering based on the graph -Laplacian. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 81–88. ACM, 2009.
• [BN01] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems, volume 14, pages 585–591, 2001.
• [BN04] Mikhail Belkin and Partha Niyogi. Semi-supervised learning on Riemannian manifolds. Machine learning, 56(1-3):209–239, 2004.
• [BO92] RC Brown and B Opic. Embeddings of weighted sobolev spaces into spaces of continuous functions. In Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, volume 439, pages 279–296. The Royal Society, 1992.
• [BZ13] Nick Bridle and Xiaojin Zhu. -voltages: Laplacian regularization for semi-supervised learning on high-dimensional data. In Eleventh Workshop on Mining and Learning with Graphs (MLG2013), 2013.
• [CC96] Vittori Castelli and Thomas M Cover. The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter. IEEE Transactions on Information Theory, 42(6):2102–2117, 1996.
• [CEG01] Michael G Crandall, Lawrence C Evans, and Ronald F Gariepy. Optimal Lipschitz extensions and the infinity Laplacian. Calculus of Variations and Partial Differential Equations, 13(2):123–139, 2001.
• [DLT83] RB Darst, DA Legg, and DW Townsend. The Pólya algorithm in approximation. Journal of Approximation Theory, 38(3):209–220, 1983.
• [Dud99] R. M. Dudley. Uniform central limit theorems. Cambridge university press, 1999.
• [EH90] Alan Egger and Robert Huotari. Rate of convergence of the discrete Pólya algorithm. Journal of approximation theory, 60(1):24–30, 1990.
• [ET87] AG Egger and GD Taylor. Dependence on of the best approximation operator. Journal of approximation theory, 49(3):274–282, 1987.
• [GF63] IM Gelfand and SV Fomin. Calculus of variations. revised english edition translated and edited by richard a. silverman, 1963.
• [Hei06a] Matthias Hein. Geometrical aspects of statistical learning theory. PhD thesis, TU Darmstadt, 2006.
• [Hei06b] Matthias Hein. Uniform convergence of adaptive graph-based regularization. In Learning Theory, pages 50–64. Springer, 2006.
• [HKM12] Juha Heinonen, Tero Kilpeläinen, and Olli Martio. Nonlinear potential theory of degenerate elliptic equations. Courier Corporation, 2012.
• [JZ07] Rie Johnson and Tong Zhang. On the effectiveness of Laplacian normalization for graph semi-supervised learning. Journal of Machine Learning Research, 8(4), 2007.
• [JZ08] Rie Johnson and Tong Zhang. Graph-based semi-supervised learning and spectral kernel design. Information Theory, IEEE Transactions on, 54(1):275–288, 2008.
• [Kol06] Vladimir Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34(6):2593–2656, 2006.
• [KRSS15] Rasmus Kyng, Anup Rao, Sushant Sachdeva, and Daniel A Spielman. Algorithms for Lipschitz learning on graphs. Proceedings of The 28th Conference on Learning Theory, pages 1190–1223, 2015.
• [Leo09] Giovanni Leoni. A first course in Sobolev spaces, volume 105. American Mathematical Society Providence, RI, 2009.
• [LHDN10] Dijun Luo, Heng Huang, Chris Ding, and Feiping Nie. On the eigenvectors of -Laplacian. Machine Learning, 81(1):37–51, 2010.
• [LT89] David A Legg and Douglas W Townsend. The Pólya algorithm for convex approximation. Journal of mathematical analysis and applications, 141(2):431–441, 1989.
• [LW07] John Lafferty and Larry Wasserman. Statistical analysis of semi-supervised regression. In Advances in Neural Information Processing Systems 20, pages 801–808. Curran Associates, Inc., 2007.
• [Men02] Shahar Mendelson. Geometric parameters of kernel machines. In Computational Learning Theory, pages 29–43. Springer, 2002.
• [NSZ09] Boaz Nadler, Nathan Srebro, and Xueyuan Zhou. Semi-supervised learning with the graph laplacian: The limit of infinite unlabelled data. In Neural Information Processing Systems, pages 1330–1338, 2009.
• [Obe13] Adam M Oberman. Finite difference methods for the infinity Laplace and -Laplace equations. Journal of Computational and Applied Mathematics, 254:65–80, 2013.
• [PS08] Yuval Peres and Scott Sheffield. Tug-of-war with noise: A game-theoretic view of the -laplacian. Duke Mathematical Journal, 145(1):91–120, 2008.
• [PSSW09] Yuval Peres, Oded Schramm, Scott Sheffield, and David Wilson. Tug-of-war and the infinity Laplacian. Journal of the American Mathematical Society, 22(1):167–210, 2009.
• [Rig07] Philippe Rigollet. Generalization error bounds in semi-supervised classification under the cluster assumption. Journal of Machine Learning Research, 8:2183–2206, 2007.
• [SB14] Rakesh Shivanna and Chiranjib Bhattacharyya. Learning on graphs using orthonormal representation is statistically consistent. In Advances in Neural Information Processing Systems, pages 3635–3643, 2014.
• [SCS15] Rakesh Shivanna, Bibaswan K Chatterjee, Raman Sankaran, Chiranjib Bhattacharyya, and Francis Bach. Spectral norm regularization of orthonormal representations for graph transduction. In Advances in Neural Information Processing Systems, pages 2206–2214, 2015.
• [Ser09] Robert J Serfling. Approximation theorems of mathematical statistics, volume 162. John Wiley & Sons, 2009.
• [vdG00] Sara van de Geer. Empirical processes in M-estimation. Cambridge University Press Cambridge, 2000.
• [WSST15] Yu-Xiang Wang, James Sharpnack, Alex Smola, and Ryan J Tibshirani. Trend filtering on graphs. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics,, page 1042–1050, 2015.
• [ZB11] Xueyuan Zhou and Mikhail Belkin. Semi-supervised learning by higher order regularization. In International Conference on Artificial Intelligence and Statistics, pages 892–900, 2011.
• [ZGL03] Xiaojin Zhu, Zoubin Ghahramani, and John Lafferty. Semi-supervised learning using Gaussian fields and harmonic functions. In Proceedings of The 31st International Conference on Machine Learning, volume 3, pages 912–919, 2003.
• [ZS05] Dengyong Zhou and Bernhard Schölkopf. Regularization on discrete spaces. In Pattern Recognition, pages 361–368. Springer, 2005.

## Appendix A Proof of Theorem 3.1

Let and be i.i.d. draws from . Since is a bounded function, the first moment is finite. Therefore, by the strong law of large numbers for U-statistics ([Ser09]), we have

 limN→∞1N2Jp(f)=∫∫φ(∥x−x′∥2h)p∣∣f(x)−f(x′)∣∣pμ(x)μ(x′)dxdx′,almost surely.

Writing for some scalar and vector , the second integral simplifies to

 ∫φ(∥x−x′∥h)p∣∣f(x)−f(x′)∣∣pμ(x′)dx′=hd∫φ(∥z∥2)p|f(x)−f(x+hz)|pμ(x+hz)dz.

We now divide by and consider the behavior as the bandwidth parameter tends to zero. Since the functions , and are all bounded on a compact domain, the dominated convergence theorem implies that

 μ(x)∫φ(∥z∥2)p|⟨∇f(x),z⟩|pdz=μ(x)⟨∇f(x)⊗p,∫φ(∥z∥2)pz⊗pdz⟩.

Note the above inner product involves tensors of order , and recall that we have assumed that is even. Since the function depends only on the norm of in the above integral, the latter should also be isotropic. The precise statement is as follows:

###### Lemma A.1.

For any function and vector , we have

 ⟨u⊗p , ∫w(∥z∥2)z⊗pdz⟩ (15)

Applying Lemma A.1 with the function and then simplifying yields

 limN→∞1N2hp+dJp(f)=1dp/2∫∥z∥p2φ(∥z∥2)pdz⋅∫∥∇f(x)∥p2μ2(x)dx,

which concludes the proof of the theorem.

The only remaining detail is to prove Lemma A.1.

#### Proof of Lemma a.1

We proceed by induction on the integer . In the base case (), we have

 ⟨uu⊺ , ∫w(∥z∥2)zz⊺dz⟩=u⊺(∫w(∥z∥2)zz⊺dz)u.

Since the function depends only on , the matrix between the parentheses above is proportional to the identity, the proportionality constant can be determined by taking a trace. We end up with . This establishes the base case.555The case is also proven in Proposition 4.1 of the paper [BCH03] Now assume that for a given even , and for all function non-negative maps , one has (15). We prove that the same is true for . Define , and for any vector , let be the partial contraction of by , namely

 ¯Tp:= Tp+2(u⊗u)=∫w(∥z∥2)z⊗p⟨u,z⟩2dz.

The tensor is of order , and the map is non-negative, so by the induction hypothesis, for every , we have

 ⟨v⊗p , ¯Tp⟩=1dp/2(∫w(∥z∥2)∥z∥p2⟨u,z⟩2dz)∥v∥p2.

By recourse to the base case, the quadratic form between the parentheses is equal to
. Taking completes the proof of the lemma.

## Appendix B Proof of Theorem 3.2

Recall the shorthand notation . By convexity, the function is a minimizer of the functional if for all test functions and all sufficiently small real numbers , we have . Moreover, by a Taylor series expansion, we have

 Ip(f+ϵh)=Ip(f)+pϵ∫⟨∇f(x),∇h(x)⟩⋅∥∇f(x)∥p−22μ2(x)dx+O(ϵ2),

where the term is non-negative by convexity of . Hence, the function is a minimizer if and only if

 ∫⟨∇f(x),∇h(x)⟩⋅∥∇f