A Primal-Dual Convergence Analysis of Boosting

A Primal-Dual Convergence Analysis of Boosting

Matus Telgarsky Department of Computer Science and Engineering, University of California, San Diego. Email: <mtelgars@cs.ucsd.edu>.
Abstract

Boosting combines weak learners into a predictor with low empirical risk. Its dual constructs a high entropy distribution upon which weak learners and training labels are uncorrelated. This manuscript studies this primal-dual relationship under a broad family of losses, including the exponential loss of AdaBoost and the logistic loss, revealing:

• Weak learnability aids the whole loss family: for any , iterations suffice to produce a predictor with empirical risk -close to the infimum;

• The circumstances granting the existence of an empirical risk minimizer may be characterized in terms of the primal and dual problems, yielding a new proof of the known rate ;

• Arbitrary instances may be decomposed into the above two, granting rate , with a matching lower bound provided for the logistic loss.

1 Introduction

Boosting is the task of converting inaccurate weak learners into a single accurate predictor. The existence of any such method was unknown until the breakthrough result of Schapire (1990): under a weak learning assumption, it is possible to combine many carefully chosen weak learners into a majority of majorities with arbitrarily low training error. Soon after, Freund (1995) noted that a single majority is enough, and that iterations are both necessary and sufficient to attain accuracy . Finally, their combined effort produced AdaBoost, which exhibits this optimal convergence rate (under the weak learning assumption), and has an astonishingly simple implementation (Freund and Schapire, 1997).

It was eventually revealed that AdaBoost was minimizing a risk functional, specifically the exponential loss (Breiman, 1999). Aiming to alleviate perceived deficiencies in the algorithm, other loss functions were proposed, foremost amongst these being the logistic loss (Friedman et al., 2000). Given the wide practical success of boosting with the logistic loss, it is perhaps surprising that no convergence rate better than was known, even under the weak learning assumption (Bickel et al., 2006). The reason for this deficiency is simple: unlike SVM, least squares, and basically any other optimization problem considered in machine learning, there might not exist a choice which attains the minimal risk! This reliance is carried over from convex optimization, where the assumption of attainability is generally made, either directly, or through stronger conditions like compact level sets or strong convexity (Luo and Tseng, 1992). But this limitation seems artificial: a function like has no minimizer but decays rapidly.

Convergence rate analysis provides a valuable mechanism to compare and improve of minimization algorithms. But there is a deeper significance with boosting: a convergence rate of means that, with a combination of just predictors, one can construct an -optimal classifier, which is crucial to both the computational efficiency and statistical stability of this predictor.

The main contribution of this manuscript is to provide a tight convergence theory for a large family of losses, including the exponential and logistic losses, which has heretofore resisted analysis. In particular, it is shown that the (disjoint) scenarios of weak learnability (Section 6.1) and attainability (Section 6.2) both exhibit the rate . These two scenarios are in a strong sense extremal, and general instances are shown to decompose into them; but their conflicting behavior yields a degraded rate (Section 6.3). A matching lower bound for the logistic loss demonstrates this is no artifact.

1.1 Outline

Beyond providing these rates, this manuscript will study the rich ecology within the primal-dual interplay of boosting.

Starting with necessary background, Section 2 provides the standard view of boosting as coordinate descent of an empirical risk. This primal formulation of boosting obscures a key internal mechanism: boosting iteratively constructs distributions where the previously selected weak learner fails. This view is recovered in the dual problem; specifically, Section 3 reveals that the dual feasible set is the collection of distributions where all weak learners have no correlation to the target, and the dual objective is a max entropy rule.

The dual optimum is always attainable; since a standard mechanism in convergence analysis to control the distance to the optimum, why not overcome the unattainability of the primal optimum by working in the dual? It turns out that the classical weak learning rate was a mechanism to control distances in the dual all along; by developing a suitable generalization (Section 4), it is possible to convert the improvement due to a single step of coordinate descent into a relevant distance in the dual (Section 6). Crucially, this holds for general instances, without any assumptions.

The final puzzle piece is to relate these dual distances to the optimality gap. Section 5 lays the foundation, taking a close look at the structure of the optimization problem. The classical scenarios of attainability and weak learnability are identifiable directly from the weak learning class and training sample; moreover, they can be entirely characterized by properties of the primal and dual problems.

Section 5 will also reveal another structure: there is a subset of the training set, the hard core, which is the maximal support of any distribution upon which every weak learner and the training labels are uncorrelated. This set is central—for instance, the dual optimum (regardless of the loss function) places positive weight on exactly the hard core. Weak learnability corresponds to the hard core being empty, and attainability corresponds to it being the whole training set. For those instances where the hard core is a nonempty proper subset of the training set, the behavior on and off the hard core mimics attainability and weak learnability, and Section 6.3 will leverage this to produce rates using facts derived for the two constituent scenarios.

Much of the technical material is relegated to the appendices. For convenience, Appendix A summarizes notation, and Appendix B contains some important supporting results. Of perhaps practical interest, Appendix D provides methods to select the step size, meaning the weight with which new weak learners are included in the full predictor. These methods are sufficiently powerful to grant the convergence rates in this manuscript.

1.2 Related Work

The development of general convergence rates has a number of important milestones in the past decade. Collins et al. (2002) proved convergence for a large family of losses, albeit without any rates. Interestingly, the step size only partially modified the choice from AdaBoost to accommodate arbitrary losses, whereas the choice here follows standard optimization principles based purely on the particular loss. Next, Bickel et al. (2006) showed a general rate of for a slightly smaller family of functions: every loss has positive lower and upper bounds on its second derivative within any compact interval. This is a larger family than what is considered in the present manuscript, but Section 6.2 will discuss the role of the extra assumptions when producing fast rates.

Many extremely important cases have also been handled. The first is the original rate of for the exponential loss under the weak learning assumption (Freund and Schapire, 1997). Next, under the assumption that the empirical risk minimizer is attainable, Rätsch et al. (2001) demonstrated the rate . The loss functions in that work must satisfy lower and upper bounds on the Hessian within the initial level set; equivalently, the existence of lower and upper bounding quadratic functions within this level set. This assumption may be slightly relaxed to needing just lower and upper second derivative bounds on the univariate loss function within an initial bounding interval (cf. discussion within Section 5.2), which is the same set of assumptions used by Bickel et al. (2006), and as discussed in Section 6.2, is all that is really needed by the analysis in the present manuscript under attainability.

Parallel to the present work, Mukherjee et al. (2011) established general convergence under the exponential loss, with a rate of . That work also presented bounds comparing the AdaBoost suboptimality to any bounded solution, which can be used to succinctly prove consistency properties of AdaBoost (Schapire and Freund, in preparation). In this case, the rate degrades to , which although presented without lower bound, is not terribly surprising since the optimization problem minimized by boosting has no norm penalization. Finally, mirroring the development here, Mukherjee et al. (2011) used the same boosting instance (due to Schapire (2010)) to produce lower bounds, and also decomposed the boosting problem into finite and infinite margin pieces (cf. Section 5.3).

It is interesting to mention that, for many variants of boosting, general convergence rates were known. Specifically, once it was revealed that boosting is trying to be not only correct but also have large margins (Schapire et al., 1997), much work was invested into methods which explicitly maximized the margin (Rätsch and Warmuth, 2002), or penalized variants focused on the inseparable case (Warmuth et al., 2007, Shalev-Shwartz and Singer, 2008). These methods generally impose some form of regularization (Shalev-Shwartz and Singer, 2008), which grants attainability of the risk minimizer, and allows standard techniques to grant general convergence rates. Interestingly, the guarantees in those works cited in this paragraph are .

Hints of the dual problem may be found in many works, most notably those of Kivinen and Warmuth (1999), Collins et al. (2002), which demonstrated that boosting is seeking a difficult distribution over training examples via iterated Bregman projections.

The notion of hard core sets is due to Impagliazzo (1995). A crucial difference is that in the present work, the hard core is unique, maximal, and every weak learner does no better than random guessing upon a family of distributions supported on this set; in this cited work, the hard core is relaxed to allow some small but constant fraction correlation to the target. This relaxation is central to the work, which provides a correspondence between the complexity (circuit size) of the weak learners, the difficulty of the target function, the size of the hard core, and the correlation permitted in the hard core.

2 Setup

A view of boosting, which pervades this manuscript, is that the action of the weak learning class upon the sample can be encoded as a matrix (Rätsch et al., 2001, Shalev-Shwartz and Singer, 2008). Let a sample and a weak learning class be given. For every , let denote the negated projection onto induced by ; that is, is a vector of length , with coordinates . If the set of all such columns is finite, collect them into the matrix . Let denote the row of , corresponding to the example , and let index the set of weak learners corresponding to columns of . It is assumed, for convenience, that entries of are within ; relaxing this assumption merely scales the presented rates by a constant.

The setting considered here is that this finite matrix can be constructed. Note that this can encode infinite classes, so long as they map to only values (in which case has at most columns). As another example, if the weak learners are binary, and has VC dimension , then Sauer’s lemma grants that has at most columns. This matrix view of boosting is thus similar to the interpretation of boosting performing descent in functional space (Mason et al., 2000, Friedman et al., 2000), but the class complexity and finite sample have been used to reduce the function class to a finite object.

To make the connection to boosting, the missing ingredient is the loss function.

Definition 2.0.

is the set of loss functions satisfying: is twice continuously differentiable, , and .

For convenience, whenever and sample size are provided, let denote the empirical risk function . For more properties of and , please see Appendix C.

The convergence rates of Section 6 will require a few more conditions, but suffices for all earlier results.

Example 2.0.

The exponential loss (AdaBoost) and logistic loss are both within (and the eventual ). These two losses appear in Figure 1, where the log-scale plot aims to convey their similarity for negative values.

This definition provides a notational break from most boosting literature, which instead requires (i.e., the exponential loss becomes ); note that the usage here simply pushes the negation into the definition of the matrix . The significance of this modification is that the gradient of the empirical risk, which corresponds to distributions produced by boosting, is a nonnegative measure. (Otherwise, it would be necessary to negate this (nonpositive) distribution everywhere to match the boosting literature.) Note that there is no consensus on this choice, and the form followed here can be found elsewhere (Boucheron et al., 2005).

Boosting determines some weighting of the columns of , which correspond to weak learners in . The (unnormalized) margin of example is thus , where is an indicator vector. (This negation is one notational inconvenience of making losses increasing.) Since the prediction on is , it follows that (where is the zero vector) implies a training error of zero. As such, boosting solves the minimization problem

 infλ∈Rnm∑i=1g(⟨ai,λ⟩)=infλ∈Rnm∑i=1g(e⊤iAλ)=infλ∈Rnf(Aλ)=infλ∈Rn(f∘A)(λ)=:¯fA; (2.1)

recall is the convenience function , and in the present problem denotes the (unnormalized) empirical risk. will denote the optimal objective value.

The infimum in (2.1) may well not be attainable. Suppose there exists such that (Theorem 5.1 will show that this is equivalent to the weak learning assumption). Then

 0≤infλ∈Rnf(Aλ)≤infc>0f(A(cλ′))=0.

On the other hand, for any , . Thus the infimum is never attainable when weak learnability holds.

The template boosting algorithm appears in Figure 2, formulated in terms of to make the connection to coordinate descent as clear as possible. To interpret the gradient terms, note that

 (∇(f∘A)(λ))j=(A⊤∇f(Aλ))j=−m∑i=1g′(⟨ai,λ⟩)hj(xi)yi,

which is the expected negative correlation of with the target labels according to an unnormalized distribution with weights . The stopping condition means: either the distribution is degenerate (it is exactly zero), or every weak learner is uncorrelated with the target.

As such, Boost in Figure 2 represents an equivalent formulation of boosting, with one minor modification: the column (weak learner) selection has an absolute value. But note that this is the same as closing under complementation (i.e., for any , there exists with ), which is assumed in many theoretical treatments of boosting.

In the case of the exponential loss and binary weak learners, the line search (when attainable) has a convenient closed form; but for other losses, and even with the exponential loss but with confidence-rated predictors, there may not be a closed form. As such, Boost only requires an approximate line search method. Appendix D details two mechanisms for this: an iterative method, which requires no knowledge of the loss function, and a closed form choice, which unfortunately requires some properties of the loss, which may be difficult to bound tightly. The iterative method provides a slightly worse guarantee, but is potentially more effective in practice; thus it will be used to produce all convergence rates in Section 6.

For simplicity, it is supposed that the best weak learner (or the approximation thereof encoded in ) can always be selected. Relaxing this condition is not without subtleties, but as discussed in Appendix E, there are ways to allow approximate selection without degrading the presented convergence rates.

As a final remark, consider the rows of as a collection of points in . Due to the form of , Boost is therefore searching for a halfspace, parameterized by a vector , which contains all of these points. Sometimes such a halfspace may not exist, and applies a smoothly increasing penalty to points that are farther and farther outside it.

3 Dual Problem

Applying coordinate descent to (2.1) represents a valid interpretation of boosting, in the sense that the resulting algorithm Boost is equivalent to the original. However this representation loses the intuitive operation of boosting as generating distributions where the current predictor is highly erroneous, and requesting weak learners accurate on these tricky distributions. The dual problem will capture this.

In addition to illuminating the structure of boosting, the dual problem also possesses a major concrete contribution to the optimization behavior, and specifically the convergence rates: the dual optimum is always attainable.

The dual problem will make use of Fenchel conjugates (Hiriart-Urruty and Lemaréchal, 2001, Borwein and Lewis, 2000); for any function , the conjugate is

 h∗(ϕ)=supx∈dom(h)⟨x,ϕ⟩−h(x).
Example 3.0.

The exponential loss has Fenchel conjugate

 (exp(⋅))∗(ϕ)=⎧⎨⎩ϕln(ϕ)−ϕwhenϕ>0,0whenϕ=0,∞otherwise.

The logistic loss has Fenchel conjugate

 (ln(1+exp(⋅)))∗(ϕ)=⎧⎨⎩(1−ϕ)ln(1−ϕ)+ϕln(ϕ)whenϕ∈(0,1),0whenϕ∈{0,1},∞otherwise.

These conjugates are known respectively as the Boltzmann-Shannon and Fermi-Dirac entropies (see Borwein and Lewis, 2000, closing commentary, Section 3.3). Please see Figure 3 for a depiction.

It further turns out that general members of have a shape reminiscent of these two standard notions of entropy.

Lemma 3.0.

Let be given. Then is continuously differentiable on , strictly convex, and either or where . Furthermore, has the following form:

 g∗(ϕ)∈⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩∞whenϕ<0,0whenϕ=0,(−g(0),0)whenϕ∈(0,g′(0)),−g(0)whenϕ=g′(0),(−g(0),∞]whenϕ>g′(0).

(The proof is in Appendix C.) There is one more object to present, the dual feasible set .

Definition 3.0.

For any , define the dual feasible set

 ΦA:=Ker(A⊤)∩Rm+

Consider any . Since , this is a weighting of examples which decorrelates all weak learners from the target: in particular, for any primal weighting over weak learners, . And since , all coordinates are nonnegative, so in the case that , this vector may be renormalized into a distribution over examples. The case is an extremely special degeneracy: it will be shown to encode the scenario of weak learnability.

Theorem 3.1.

For any and with ,

 (3.2)

where . The right hand side is the dual problem, and moreover the dual optimum, denoted , is unique and attainable.

(The proof uses routine techniques from convex analysis, and is deferred to Section G.2.)

The definition of does not depend on any specific ; this choice was made to provide general intuition on the structure of the problem for the entire family of losses. Note however that this will cause some problems later. For instance, with the logistic loss, the vector with every value two, i.e. , has objective value . In a sense, there are points in which are not really candidates for certain losses, and this fact will need adjustment in some convergence rate proofs.

Remark 3.2.

Finishing the connection to maximum entropy, for any , by Section 3, the optimum of the unconstrained problem is , a rescaling of the uniform distribution. But note that : that is, the initial dual iterate is the unconstrained optimum! Let denote the dual iterate; since (cf. Section B.2), then for any ,

 ⟨∇f∗(ϕt),ψ⟩=⟨Aλt,ψ⟩=⟨λt,A⊤ψ⟩=0.

This allows the dual optimum to be rewritten as

 ψfA =\operatornamewithlimitsargminψ∈ΦAf∗(ψ) =\operatornamewithlimitsargminψ∈ΦAf∗(ψ)−f∗(ϕt)−⟨∇f∗(ϕt),ψ−ϕt⟩;

that is, the dual optimum is the Bregman projection (according to ) onto of any dual iterate . In particular, is the Bregman projection onto the feasible set of the unconstrained optimum !

The connection to Bregman divergences runs deep; in fact, mirroring the development of Boost as “compiling out” the dual variables in the classical boosting presentation, it is possible to compile out the primal variables, producing an algorithm using only dual variables, meaning distributions over examples. This connection has been explored extensively (Kivinen and Warmuth, 1999, Collins et al., 2002).

Remark 3.2.

It may be tempting to use Theorem 3.1 to produce a stopping condition; that is, if for a supplied , a primal iterate and dual feasible can be found satisfying , Boost may terminate with the guarantee .

Unfortunately, it is unclear how to produce dual iterates (excepting the trivial ). If can be computed, it suffices to project onto this subspace. In general however, not only is painfully expensive to compute, this computation does not at all fit the oracle model of boosting, where access to is obscured. (What is when the weak learning oracle learns a size-bounded decision tree?)

In fact, noting that the primal-dual relationship from (3.2) can be written

 inf{f(Λ):Λ∈Im(A)}=sup{−f∗(Ψ):Ψ∈Ker(A⊤)=Im(A)⊥}

(since encodes the orthant constraint), the standard oracle model gives elements of , but what is needed in the dual is an oracle for .

4 Generalized Weak Learning Rate

The weak learning rate was critical to the original convergence analysis of AdaBoost, providing a handle on the progress of the algorithm. But to be useful, this value must be positive, which was precisely the condition granted by the weak learning assumption. This section will generalize the weak learning rate into a quantity which can be made positive for any boosting instance.

Note briefly that this manuscript will differ slightly from the norm in that weak learning will be a purely sample-specific concept. That is, the concern here is convergence in empirical risk, and all that matters is the sample , as encoded in ; it doesn’t matter if there are wild points outside this sample, because the algorithm has no access to them.

This distinction has the following implication. The usual weak learning assumption states that there exists no uncorrelating distribution over the input space. This of course implies that any training sample used by the algorithm will also have this property; however, it suffices that there is no distribution over the input sample which uncorrelates the weak learners from the target.

Returning to task, the weak learning assumption posits the existence of a positive constant, the weak learning rate , which lower bounds the correlation of the best weak learner with the target for any distribution. Stated in terms of the matrix ,

 0<γ=infϕ∈Rm+∥ϕ∥=1maxj∈[n]∣∣ ∣∣m∑i=1(ϕ)iyihj(xi)∣∣ ∣∣=infϕ∈Rm+∖{0m}∥A⊤ϕ∥∞∥ϕ∥1=infϕ∈Rm+∖{0m}∥A⊤ϕ∥∞∥ϕ−0m∥1. (4.1)
Proposition 4.1.

A boosting instance is weak learnable iff .

Proof.

Suppose ; since the first infimum in (4.1) is of a continuous function over a compact set, it has some minimizer . But , meaning , and so . On the other hand, if , take any ; then

 0≤γ=infϕ∈Rm+∖{0m}∥A⊤ϕ∥∞∥ϕ∥1≤∥A⊤ϕ′′∥∞∥ϕ′′∥1=0.\qed

Following this connection, the first way in which the weak learning rate is modified is to replace with the dual feasible set . For reasons that will be sketched shortly, but fully dealt with only in Section 6, it is necessary to replace with a more refined choice .

Definition 4.1.

Given a matrix and a set , define

 γ(A,S):=inf{∥A⊤ϕ∥∞infψ∈S∩Ker(A⊤)∥ϕ−ψ∥1:ϕ∈S∖Ker(A⊤)}.

First note that in the scenario of weak learnability (i.e., by Section 4), the choice allows the new notion to exactly cover the old one: .

To get a better handle on the meaning of , first define the following projection and distance notation to a closed convex nonempty set , where in the case of non-uniqueness ( and ), some arbitrary choice is made:

 PpC(x)∈\operatornamewithlimitsArgminy∈C∥y−x∥p, DpC(x)=∥x−PpC(x)∥p.

Suppose, for some , that ; then the infimum within may be instantiated with , yielding

 γ(A,S)=infϕ∈S∖Ker(A⊤)∥A⊤ϕ∥∞∥ϕ−P1S∩Ker(A⊤)(ϕ)∥1≤∥A⊤∇f(Aλt)∥∞∥∇f(Aλt)−P1S∩Ker(A⊤)(∇f(Aλt))∥1. (4.2)

Rearranging this,

 γ(A,S)∥∥∇f(Aλt)−P1S∩Ker(A⊤)(∇f(Aλt))∥∥1≤∥A⊤∇f(Aλt)∥∞. (4.3)

This is helpful because the right hand side appears in standard guarantees for single-step progress in descent methods. Meanwhile, the left hand side has reduced the influence of to a single number, and the normed expression is the distance to a restriction of dual feasible set, which will converge to zero if the infimum is to be approached, so long as this restriction contains the dual optimum.

This will be exactly the approach taken in this manuscript; indeed, the first step towards convergence rates, Section 6, will use exactly the upper bound in (4.3). The detailed work that remains is then dealing with the distance to the dual feasible set. The choice of will be made to facilitate the production of these bounds, and will depend on the optimization structure revealed in Section 5.

In order for these expressions to mean anything, must be positive.

Theorem 4.4.

Let matrix and polyhedron be given with and . Then .

The proof, material on other generalizations of , and discussion on the polyhedrality of can all be found in Appendix F.

As a final connection, since , note that

In this way, resembles a Lipschitz constant, reflecting the effect of on elements of the dual, relative to the dual feasible set.

5 Optimization Structure

The scenario of weak learnability translates into a simple condition on the dual feasible set: the dual feasible set is the origin (in symbols, ). And how about attainability—is there a simple way to encode this problem in terms of the optimization problem?

This section will identify the structure of the boosting optimization problem both in terms of the primal and dual problems, first studying the scenarios of weak learnability and attainability, and then showing that general instances can be decomposed into these two.

There is another behavior which will emerge through this study, motivated by the following question. The dual feasible set is the set of nonnegative weightings of examples under which every weak learner (every column of ) has zero correlation; what is the support of these weightings?

Definition 5.0.

denotes the hard core of : the collection of examples which receive positive weight under some dual feasible point, a distribution upon which no weak learner is correlated with the target. Symbolically,

 H(A):={i∈[m]:∃ψ∈ΦA,(ψ)i>0}.

One case has already been considered; as established in Section 4, weak learnability is equivalent to , which in turn is equivalent to . But it will turn out that other possibilities for also have direct relevance to the behavior of Boost. Indeed, contrasted with the primal and dual problems and feasible sets, will provide a conceptually simple, discrete object with which to comprehend the behavior of boosting.

5.1 Weak Learnability

The following theorem establishes four equivalent formulations of weak learnability.

Theorem 5.1.

For any and the following conditions are equivalent:

1. ,

2. ,

3. ,

4. .

First note that (5.1.4) indicates (via Section 4) this is indeed the weak learnability setting, equivalently .

Recall the earlier discussion of boosting as searching for a halfspace containing the points ; property (5.1.1) encodes precisely this statement, and moreover that there exists such a halfspace with these points interior to it. Note that this statement also encodes the margin separability equivalence of weak learnability due to Shalev-Shwartz and Singer (2008); specifically, if labels are bounded away from 0 and each point (row of ) is replaced with , the definition of grants that positive examples will land on one side of the hyperplane, and negative examples on the other.

The two properties (5.1.4) and (5.1.1) can be interpreted geometrically, as depicted in Figure 4: the dual feasibility statement is that no convex combination of will contain the origin.

Next, (5.1.2) is the (error part of the) usual strong PAC guarantee (Schapire, 1990): weak learnability entails that the training error will go to zero. And, as must be the case when , property (5.1.3) provides that .

Proof of Theorem 5.1.

() Let be given with , and let any increasing sequence be given. Then, since and ,

 infλf(Aλ)≤limi→∞f(ciA¯λ)=0≤infλf(Aλ).

() The point is always dual feasible, and

 infλf(Aλ)=0=−f∗(0m).

Since the dual optimum is unique (Theorem 3.1), .

() Suppose there exists with . Since is continuous and increasing along every positive direction at (see Section 3 and Appendix C), there must exist some tiny such that , contradicting the selection of as the unique optimum.

() This case is directly handled by Gordan’s theorem (cf. Theorem B.1). ∎

5.2 Attainability

For strictly convex functions, there is a nice characterization of attainability, which will require the following definition.

Definition 5.1 (cf. Hiriart-Urruty and Lemaréchal (2001, Definition B.3.2.5)).

A closed convex function is called 0-coercive when all level sets are compact. (That is, for any , the set is compact.)

Proposition 5.1.

Suppose is differentiable, strictly convex, and . Then is attainable iff is 0-coercive.

Note that 0-coercivity means the domain of the infimum in (2.1) can be restricted to a compact set, and attainability in turn follows just from properties of minimization of continuous functions on compact sets. It is the converse which requires some structure; the proof however is unilluminating and deferred to Section G.3.

Armed with this notion, it is now possible to build an attainability theory for . Some care must be taken with the above concepts, however; note that while is strictly convex, need not be (for instance, if there exist nonzero elements of , then moving along these directions does not change the objective value). Therefore, 0-coercivity statements will refer to the function

 (f+ιIm(A))(x)={f(x)whenx∈Im(A),∞otherwise.

This function is effectively taking the epigraph of , and intersecting it with a slice representing , the set of points considered by the algorithm. As such, it is merely a convenient way of dealing with as discussed above.

Theorem 5.2.

For any and , the following conditions are equivalent:

1. ,

2. is 0-coercive,

3. ,

4. .

Following the discussion above, (5.2.2) is the desired attainability statement.

Next, note that (5.2.4) is equivalent to the expression , i.e. there exists a distribution with positive weight on all examples, upon which every weak learner is uncorrelated. The forward direction is direct from the existence of a single . For the converse, note that the corresponding to each can be combined into (since is a subspace).

For a geometric interpretation, consider (5.2.1) and (5.2.4). The first says that any halfspace containing some within its interior must also fail to contain some (with ). (Property (5.2.1) also allows for the scenario that no valid enclosing halfspace exists, i.e. .) The latter states that the origin is contained within a positive convex combination of (alternatively, the origin is within the relative interior of these points). These two scenarios appear in Figure 5.

Finally, note (5.2.3): it is not only the case that there are dual feasible points fully interior to , but furthermore the dual optimum is also interior. This will be crucial in the convergence rate analysis, since it will allow the dual iterates to never be too small.

Proof of Theorem 5.2.

() Let and be arbitrary. To show 0-coercivity, it suffices (Hiriart-Urruty and Lemaréchal, 2001, Proposition B.3.2.4.iii) to show

 limt→∞f(Aλ+td)+ιIm(A)(Aλ+td)−f(Aλ)t>0. (5.3)

If (and ), then . Suppose ; by (5.2.1), since , then , meaning there is at least one positive coordinate . But then, since and is convex,

 (???) ≥limt→∞g(e⊤j(Aλ+td))−f(Aλ)t ≥limt→∞g(e⊤jAλ)+tdjg′(e⊤jAλ)−f(Aλ)t =djg′(e⊤jAλ),

which is positive by the selection of and since .

() Since the infimum is attainable, designate any satisfying (note, although is strictly convex, need not be, thus uniqueness is not guaranteed!). The optimality conditions of Fenchel problems may be applied, meaning , which is interior to since everywhere (cf. Appendix C). (For the optimality conditions, see Borwein and Lewis (2000, Exercise 3.3.9.f), with a negation inserted to match the negation inserted within the proof of Theorem 3.1.)

() This holds since and .

() This case is directly handled by Stiemke’s Theorem (cf. Theorem B.4). ∎

5.3 General Setting

So far, the scenarios of weak learnability and attainability corresponded to the extremal hard core cases of . The situation in the general setting is basically as good as one could hope for: it interpolates between the two extremal cases.

As a first step, partition into two submatrices according to .

Definition 5.3.

Partition by rows into two matrices and , where has rows corresponding to , and . For convenience, permute the examples so that

 A=[A0A+].

(This merely relabels the coordinate axes, and does not change the optimization problem.) Note that this decomposition is unique, since is uniquely specified.

As a first consequence, this partition cleanly decomposes the dual feasible set into and .

Proposition 5.3.

For any , , , and

 ΦA=ΦA0×ΦA+.

Furthermore, no other partition of into and satisfies these properties.

Proof.

It must hold that , since otherwise there would exist with , which could be extended to and the positive coordinate of could be added to , contradicting the construction of as including all such rows.

The property was proved in the discussion of Theorem 5.2: simply add together, for each , the ’s corresponding to positive weight on .

For the decomposition, note first that certainly every satisfies . Now suppose contradictorily that there exists . There must exist with , since otherwise ; but that means should have been included in , a contradiction.

For the uniqueness property, suppose some other is given, satisfying the desired properties. It is impossible that some is not in , since any can be extended to with positive weight on , and thus is included in by definition. But the other case with but is equally untenable, since the corresponding measure is in but not in . ∎

The main result of this section will have the same two main ingredients as Section 5.3:

• The full boosting instance may be uniquely decomposed into two pieces, and , each of which individually behave like the weak learnability and attainability scenarios.

• The subinstances have a somewhat independent effect on the full instance.

Theorem 5.4.

Let and be given. Let , be any partition of by rows. The following conditions are equivalent:

1. and ,

2. , and ,
and is 0-coercive,

3. with and ,

4. , and , and .

Stepping through these properties, notice that (5.4.4) mirrors the expression in Section 5.3. But that section also granted that this representation was unique, thus only one partition of satisfies the above properties, namely . Since this Theorem is stated as a series of equivalences, any one of these properties can in turn be used to identify the hard core set .

To continue with geometric interpretations, notice that (5.4.1) states that there exists a halfspace strictly containing those points in , with all points of on its boundary; furthermore, trying to adjust this halfspace to contain elements of will place others outside it. With regards to the geometry of the dual feasible set as provided by (5.4.4), the origin is within the relative interior of the points corresponding to , however the convex hull of the other points can not contain the origin. Furthermore, if the origin is written as a convex combination of all points, this combination must place zero weight on the points with indices . This scenario is depicted in Figure 6.

In properties (5.4.2) and (5.4.3), mirrors the behavior of weakly learnable instances in Theorem 5.1, and analogously follows instances with minimizers from Theorem 5.2. The interesting addition, as discussed above, is the independence of these components: (5.4.2) provides that the infimum of the combined problem is the sum of the infima of the subproblems, while (5.4.3) provides that the full dual optimum may be obtained by concatenating the subproblems’ dual optima.

Proof of Theorem 5.4.

() Let be given with and , and let be an arbitrary sequence increasing without bound. Lastly, let be a minimizing sequence for . Then

 infλf(B+λ) =limi→∞(f(B+λi)+f(ciB0¯λ))≥infλf(Aλ) =infλ(f(B+λ)+f(B0λ))≥infλf(B+λ),

which used the fact that since . And since the chain of inequalities starts and ends the same, it must be a chain of equalities, which means . To show 0-coercivity of , note the second part of is one of the conditions of Theorem 5.2.

() First, by Theorem 5.1, means and . Thus

 −f∗(ψfA) =supψ∈ΦA−f∗(ψ) =sup{−f∗(ψz)−f∗(ψp):ψz∈Rz+,ψp∈Rp+,B⊤0ψz+B⊤+ψp=0n} ≥supψz∈ΦB0−f∗(ψz)+supψp∈ΦB+−f∗(ψp) =0−f∗(ψfB+)=infλ∈Rnf(B+λ)=infλ∈Rnf(Aλ)=−f∗(ψfA).

Combining this with and (cf. creftypepluralcap 3.1 and 3), . But Theorem 3.1 shows was unique, which gives the result. And to obtain , use Theorem 5.2 with the 0-coercivity of .

() Since , it follows by Theorem 5.1 that . Furthermore, since , it follows that . Now suppose contradictorily that ; since it always holds that , this supposition grants the existence of where .

Consider the element