On Structured Prediction Theory with Calibrated Convex Surrogate Losses

# On Structured Prediction Theory with Calibrated Convex Surrogate Losses

Anton Osokin
INRIA/ENS, Paris, France
HSE, Moscow, Russia &Francis Bach
INRIA/ENS11footnotemark: 1, Paris, France &Simon Lacoste-Julien
MILA and DIRO
DI École normale supérieure, CNRS, PSL Research UniversityNational Research University Higher School of Economics
###### Abstract

We provide novel theoretical insights on structured prediction in the context of efficient convex surrogate loss minimization with consistency guarantees. For any task loss, we construct a convex surrogate that can be optimized via stochastic gradient descent and we prove tight bounds on the so-called “calibration function” relating the excess surrogate risk to the actual risk. In contrast to prior related work, we carefully monitor the effect of the exponential number of classes in the learning guarantees as well as on the optimization complexity. As an interesting consequence, we formalize the intuition that some task losses make learning harder than others, and that the classical 0-1 loss is ill-suited for structured prediction.

On Structured Prediction Theory with Calibrated Convex Surrogate Losses

Anton Osokin INRIA/ENSthanks: DI École normale supérieure, CNRS, PSL Research University, Paris, France HSEthanks: National Research University Higher School of Economics, Moscow, Russia Francis Bach INRIA/ENS11footnotemark: 1, Paris, France Simon Lacoste-Julien MILA and DIRO Université de Montréal, Canada

\@float

noticebox[b]31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\end@float

## 1 Introduction

Structured prediction is a subfield of machine learning aiming at making multiple interrelated predictions simultaneously. The desired outputs (labels) are typically organized in some structured object such as a sequence, a graph, an image, etc. Tasks of this type appear in many practical domains such as computer vision (Nowozin & Lampert, 2011), natural language processing (Smith, 2011) and bioinformatics (Durbin et al., 1998).

The structured prediction setup has at least two typical properties differentiating it from the classical binary classification problems extensively studied in learning theory:

1. Exponential number of classes: this brings both additional computational and statistical challenges. By exponential, we mean exponentially large in the size of the natural dimension of output, e.g., the number of all possible sequences is exponential w.r.t. the sequence length.

2. Cost-sensitive learning: in typical applications, prediction mistakes are not all equally costly. The prediction error is usually measured with a highly-structured task-specific loss function, e.g., Hamming distance between sequences of multi-label variables or mean average precision for ranking.

Despite many algorithmic advances to tackle structured prediction problems (Bakir et al., 2007; Nowozin et al., 2014), there have been relatively few papers devoted to its theoretical understanding. Notable recent exceptions that made significant progress include Cortes et al. (2016) and London et al. (2016) (see references therein) which proposed data-dependent generalization error bounds in terms of popular empirical convex surrogate losses such as the structured hinge loss (Taskar et al., 2003, 2005; Tsochantaridis et al., 2005). A question not addressed by these works is whether their algorithms are consistent: does minimizing their convex bounds with infinite data lead to the minimization of the task loss as well? Alternatively, the structured probit and ramp losses are consistent McAllester (2007); McAllester & Keshet (2011), but non-convex and thus it is hard to obtain computational guarantees for them. In this paper, we aim at getting the property of consistency for surrogate losses that can be efficiently minimized with guarantees, and thus we consider convex surrogate losses.

The consistency of convex surrogates is well understood in the case of binary classification (Zhang, 2004b; Bartlett et al., 2006; Steinwart, 2007) and there is significant progress in the case of multi-class 0-1 loss (Zhang, 2004a; Tewari & Bartlett, 2007) and general multi-class loss functions (Ávila Pires et al., 2013; Ramaswamy & Agarwal, 2016; Williamson et al., 2016). A large body of work specifically focuses on the related tasks of ranking Duchi et al. (2010); Calauzènes et al. (2012); Ramaswamy et al. (2013) and ordinal regression (Pedregosa et al., 2017).

Contributions. In this paper, we study consistent convex surrogate losses specifically in the context of an exponential number of classes. We argue that even while being consistent, a convex surrogate might not allow efficient learning. As a concrete example, Ciliberto et al. (2016) recently proposed a consistent approach to structured prediction, but the constant in their generalization error bound can be exponentially large as we explain in Section 5. There are two possible sources of difficulties from the optimization perspective: to reach adequate accuracy on the task loss, one might need to optimize a surrogate loss to exponentially small accuracy; or to reach adequate accuracy on the surrogate loss, one might need an exponential number of algorithm steps because of exponentially large constants in the convergence rate. We propose a theoretical framework that jointly tackles these two aspects and allows to judge the feasibility of efficient learning. In particular, we construct a calibration function Steinwart (2007), i.e., a function setting the relationship between accuracy on the surrogate and task losses, and normalize it by the means of convergence rate of an optimization algorithm.

Aiming for the simplest possible application of our framework, we propose a family of convex surrogates that are consistent for any given task loss and can be optimized using stochastic gradient descent. For a special case of our family (quadratic surrogate), we provide a complete analysis including general lower and upper bounds on the calibration function for any task loss, with exact values for the 0-1, block 0-1 and Hamming losses. We observe that to have a tractable learning algorithm, one needs both a structured loss (not the 0-1 loss) and appropriate constraints on the predictor, e.g., in the form of linear constraints for the score vector functions. Our framework also indicates that in some cases it might be beneficial to use non-consistent surrogates. In particular, a non-consistent surrogate might allow optimization only up to specific accuracy, but exponentially faster than a consistent one.

We introduce the structured prediction setting suitable for studying consistency in Sections 2 and 3. We analyze the calibration function for the quadratic surrogate loss in Section 4. We review the related works in Section 5 and conclude in Section 6.

## 2 Structured prediction setup

In structured prediction, the goal is to predict a structured output  (such as a sequence, a graph, an image) given an input . The quality of prediction is measured by a task-dependent loss function  specifying the cost for predicting  when the correct output is . In this paper, we consider the case when the number of possible predictions and the number of possible labels are both finite. For simplicity,111Our analysis is generalizable to rectangular losses, e.g., ranking losses studied by Ramaswamy et al. (2013). we also assume that the sets of possible predictions and correct outputs always coincide and do not depend on . We refer to this set as the set of labels , denote its cardinality by , and map its elements to . In this setting, assuming that the loss function depends only on  and , but not on  directly, the loss is defined by a loss matrix . We assume that all the elements of the matrix  are non-negative and will use  to denote the maximal element. Compared to multi-class classification, is typically exponentially large in the size of the natural dimension of , e.g., contains all possible sequences of symbols from a finite alphabet.

Following standard practices in structured prediction (Collins, 2002; Taskar et al., 2003), we define the prediction model by a score function specifying a score for each possible output . The final prediction is done by selecting a label with the maximal value of the score

 {pred}(f(x)):=\operatornamewithlimitsargmax^y∈Yf^y(x), (1)

with some fixed strategy to resolve ties. To simplify the analysis, we assume that among the labels with maximal scores, the predictor always picks the one with the smallest index.

The goal of prediction-based machine learning consists in finding a predictor that works well on the unseen test set, i.e., data points coming from the same distribution  as the one generating the training data. One way to formalize this is to minimize the generalization error, often referred to as the actual (or population) risk based on the loss ,

 RL(f):=IE(x,y)∼DL({pred}(f(x)),y). (2)

Minimizing the actual risk (2) is usually hard. The standard approach is to minimize a surrogate risk, which is a different objective easier to optimize, e.g., convex. We define a surrogate loss as a function depending on a score vector and a target label  as input arguments. We denote the -th component of  with . The surrogate risk (the -risk) is defined as

 RΦ(f):=IE(x,y)∼DΦ(f(x),y), (3)

where the expectation is taken w.r.t. the data-generating distribution . To make the minimization of (3) well-defined, we always assume that the surrogate loss  is bounded from below and continuous.

Examples of common surrogate losses include the structured hinge-loss (Taskar et al., 2003; Tsochantaridis et al., 2005) the log loss (maximum likelihood learning) used, e.g., in conditional random fields (Lafferty et al., 2001), and their hybrids (Pletscher et al., 2010; Gimpel & Smith, 2010; Hazan & Urtasun, 2010; Shi et al., 2015).

In terms of task losses, we consider the unstructured 0-1 loss ,222Here we use the Iverson bracket notation, i.e., if a logical expression  is true, and zero otherwise. and the two following structured losses: block 0-1 loss with  equal blocks of labels ; and (normalized) Hamming loss between tuples of  binary variables : . To illustrate some aspects of our analysis, we also look at the mixed loss : a convex combination of the 0-1 and block 0-1 losses, defined as for some .

## 3 Consistency for structured prediction

### 3.1 Calibration function

We now formalize the connection between the actual risk  and the surrogate -risk via the so-called calibration function, see Definition 1 below (Bartlett et al., 2006; Zhang, 2004a; Steinwart, 2007; Duchi et al., 2010; Ávila Pires et al., 2013). As it is standard for this kind of analysis, the setup is non-parametric, i.e. it does not take into account the dependency of scores on input variables . For now, we assume that a family of score functions  consists of all vector-valued Borel measurable functions where is a subspace of allowed score vectors, which will play an important role in our analysis. This setting is equivalent to a pointwise analysis, i.e, looking at the different input independently. We bring the dependency on the input back into the analysis in Section 3.3 where we assume a specific family of score functions.

Let represent the marginal distribution for on and denote its conditional given . We can now rewrite the risk  and -risk as

where the conditional risk  and the conditional -risk  depend on a vector of scores  and a conditional distribution on the set of output labels  as

 ℓ(f,q):=∑kc=1qcL({pred}(f),c),ϕ(f,q):=∑kc=1qcΦ(f,c).

The calibration function  between the surrogate loss  and the task loss  relates the excess surrogate risk with the actual excess risk via the excess risk bound:

 HΦ,L,F(δℓ(f,q))≤δϕ(f,q),∀f∈F,∀q∈Δk, (4)

where ,  are the excess risks and denotes the probability simplex on elements.

In other words, to find a vector that yields an excess risk smaller than , we need to optimize the -risk up to accuracy (in the worst case). We make this statement precise in Theorem 2 below, and now proceed to the formal definition of the calibration function.

###### Definition 1 (Calibration function).

For a task loss , a surrogate loss , a set of feasible scores , the calibration function  (defined for ) equals the infimum excess of the conditional surrogate risk when the excess of the conditional actual risk is at least :

 HΦ,L,F(ε):=inf\mathclapf∈F,q∈Δk δϕ(f,q) (5) s.t. δℓ(f,q)≥ε. (6)

We set to when the feasible set is empty.

By construction, is non-decreasing on , , the inequality (4) holds, and . Note that can be non-convex and even non-continuous (see examples in Figure 1). Also, note that large values of are better.

### 3.2 Notion of consistency

We use the calibration function to set a connection between optimizing the surrogate and task losses by Theorem 2, which is similar to Theorem 3 of Zhang (2004a).

###### Theorem 2 (Calibration connection).

Let be the calibration function between the surrogate loss  and the task loss  with feasible set of scores . Let be a convex non-decreasing lower bound of the calibration function. Assume that  is continuous and bounded from below. Then, for any with finite and any , we have

 RΦ(f)

where and .

###### Proof.

We take the expectation of (4) w.r.t. , where the second argument of is set to the conditional distribution . Then, we apply Jensen’s inequality (since is convex) to get

 ˇHΦ,L,F(RL(f)−R∗L,F)≤RΦ(f)−R∗Φ,F<ˇHΦ,L,F(ε), (8)

which implies (7) by monotonicity of . ∎

A suitable convex non-decreasing lower bound  required by Theorem 2 always exists, e.g., the zero constant. However, in this case Theorem 2 is not informative, because the l.h.s. of (7) is never true. Zhang (2004a, Proposition 25) claims that  defined as the lower convex envelope of the calibration function  satisfies , , if , , and, e.g., the set of labels is finite. This statement implies that an informative always exists and allows to characterize consistency through properties of the calibration function .

We now define a notion of level- consistency, which is more general than consistency.

###### Definition 3 (level-η consistency).

A surrogate loss  is consistent up to level  w.r.t. a task loss  and a set of scores  if and only if the calibration function satisfies for all and there exists such that is finite.

Looking solely at (standard level-) consistency vs. inconsistency might be too coarse to capture practical properties related to optimization accuracy (see, e.g., (Long & Servedio, 2013)). For example, if only for very small values of , then the method can still optimize the actual risk up to a certain level which might be good enough in practice, especially if it means that it can be optimized faster. Examples of calibration functions for consistent and inconsistent surrogate losses are shown in Figure 1.

Other notions of consistency. Definition 3 with and results in the standard setting often appearing in the literature. In particular, in this case Theorem 2 implies Fisher consistency as formulated, e.g., by Pedregosa et al. (2017) for general losses and Lin (2004) for binary classification. This setting is also closely related to many definitions of consistency used in the literature. For example, for a bounded from below and continuous surrogate, it is equivalent to infinite-sample consistency (Zhang, 2004a), classification calibration (Tewari & Bartlett, 2007), edge-consistency (Duchi et al., 2010), -calibration (Ramaswamy & Agarwal, 2016), prediction calibration (Williamson et al., 2016). See (Zhang, 2004a, Appendix A) for the detailed discussion.

Role of . Let the approximation error for the restricted set of scores be defined as . For any conditional distribution , the score vector will yield an optimal prediction. Thus the condition is sufficient for to have zero approximation error for any distribution , and for our -consistency condition to imply the standard Fisher consistency with respect to . In the following, we will see that a restricted can both play a role for computational efficiency as well as statistical efficiency (thus losses with smaller might be easier to work with).

### 3.3 Connection to optimization accuracy and statistical efficiency

The scale of a calibration function is not intrinsically well-defined: we could multiply the surrogate function by a scalar and it would multiply the calibration function by the same scalar, without changing the optimization problem. Intuitively, we would like the surrogate loss to be of order . If with this scale the calibration function is exponentially small (has a  factor), then we have strong evidence that the stochastic optimization will be difficult (and thus learning will be slow).

To formalize this intuition, we add to the picture the complexity of optimizing the surrogate loss with a stochastic approximation algorithm. By using a scale-invariant convergence rate, we provide a natural normalization of the calibration function. The following two observations are central to the theoretical insights provided in our work:

1. Scale. For a properly scaled surrogate loss, the scale of the calibration function is a good indication of whether a stochastic approximation algorithm will take a large number of iterations (in the worst case) to obtain guarantees of small excess of the actual risk (and vice-versa, a large coefficient indicates a small number of iterations). The actual verification requires computing the normalization quantities given in Theorem 6 below.

2. Statistics. The bound on the number of iterations directly relates to the number of training examples that would be needed to learn, if we see each iteration of the stochastic approximation algorithm as using one training example to optimize the expected surrogate.

To analyze the statistical convergence of surrogate risk optimization, we have to specify the set of score functions that we work with. We assume that the structure on input  is defined by a positive definite kernel . We denote the corresponding reproducing kernel Hilbert space (RKHS) by  and its explicit feature map by . By the reproducing property, we have for all , , where is the inner product in the RKHS. We define the subspace of allowed scores  via the span of the columns of a matrix . The matrix  explicitly defines the structure of the score function. With this notation, we will assume that the score function is of the form , where is a linear operator to be learned (a matrix if is of finite dimension) that represents a collection of  elements in , transforming to a vector in  by applying the RKHS inner product  times.333Note that if , our setup is equivalent to assuming a joint kernel (Tsochantaridis et al., 2005) in the product form: , where is the row for matrix . Note that for structured losses, we usually have . The set of all score functions is thus obtained by varying  in this definition and is denoted by . As a concrete example of a score family  for structured prediction, consider the standard sequence model with unary and pairwise potentials. In this case, the dimension  equals , where  is the sequence length and is the number of labels of each variable. The columns of the matrix  consist of groups (one for each unary and pairwise potential). Each row of  has exactly one entry equal to one in each column group (with zeros elsewhere).

In this setting, we use the online projected averaged stochastic subgradient descent ASGD444See, e.g., Orabona (2014) for the formal setup of kernel ASGD. (stochastic w.r.t. data ) to minimize the surrogate risk directly Bousquet & Bottou (2008). The -th update consists in

 W(n):=PD[W(n−1)−γ(n)FT∇Φψ(x(n))T], (9)

where is the stochastic functional gradient, is the step size and is the projection on the ball of radius  w.r.t. the Hilbert–Schmidt norm555The Hilbert–Schmidt norm of a linear operator is defined as  where is the adjoint operator. In the case of finite dimension, the Hilbert–Schmidt norm coincides with the Frobenius matrix norm.. The vector is a regular gradient of the sampled surrogate w.r.t. the scores, . We wrote the above update using an explicit feature map for notational simplicity, but kernel ASGD can also be implemented without it by using the kernel trick. The convergence properties of ASGD in RKHS are analogous to the finite-dimensional ASGD because they rely on dimension-free quantities. To use a simple convergence analysis, we follow Ciliberto et al. (2016) and make the following simplifying assumption:

###### Assumption 4 (Well-specified optimization w.r.t. the function class FF,H).

The distribution  is such that has some global minimum  that also belongs to .

Assumption 4 simply means that each row of defining  belongs to the RKHS  implying a finite norm . Assumption 4 can be relaxed if the kernel  is universal, but then the convergence analysis becomes much more complicated Orabona (2014).

###### Theorem 5 (Convergence rate).

Under Assumption 4 and assuming that (i) the functions are bounded from below and convex w.r.t.  for all ; (ii) the expected square of the norm of the stochastic gradient is bounded, and (iii) , then running the ASGD algorithm (9) with the constant step-size  for  steps admits the following expected suboptimality for the averaged iterate :

 IE[RΦ(¯f(N))]−R∗Φ,F≤2DM√Nwhere¯f(N):=1N∑Nn=1FW(n)ψ(x(n))T. (10)

Theorem 5 is a straight-forward extension of classical results Nemirovski et al. (2009); Orabona (2014).

By combining the convergence rate of Theorem 5 with Theorem 2 that connects the surrogate and actual risks, we get Theorem 6 which explicitly gives the number of iterations required to achieve  accuracy on the expected population risk (see App. A for the proof). Note that since ASGD is applied in an online fashion, Theorem 6 also serves as the sample complexity bound, i.e., says how many samples are needed to achieve target accuracy (compared to the best prediction rule if has zero approximation error).

###### Theorem 6 (Learning complexity).

Under the assumptions of Theorem 5, for any , the random (w.r.t. the observed training set) output  of the ASGD algorithm after

 N>N∗:=4D2M2(()ˇH2Φ,L,F(ε) (11)

iterations has the expected excess risk bounded with , i.e.,

## 4 Calibration function analysis for quadratic surrogate

A major challenge to applying Theorem 6 is the computation of the calibration function . In App. C, we present a generalization to arbitrary multi-class losses of a surrogate loss class from Zhang (2004a, Section 4.4.2) that is consistent for any task loss . Here, we consider the simplest example of this family, called the quadratic surrogate , which has the advantage that we can bound or even compute exactly its calibration function. We define the quadratic surrogate as

One simple sufficient condition for the surrogate (12) to be consistent and also to have zero approximation error is that fully contains . To make the dependence on the score subspace explicit, we parameterize it with a matrix  with the number of columns  typically being much smaller than the number of labels . With this notation, we have , and the dimensionality of  equals the rank of , which is at most .666Evaluating requires computing and for which direct computation is intractable when is exponential, but which can be done in closed form for the structured losses we consider (the Hamming and block 0-1 loss). More generally, these operations require suitable inference algorithms. See also App. F.

For the quadratic surrogate (12), the excess of the expected surrogate takes a simple form:

Equation (13) holds under the assumption that the subspace  contains the column space of the loss matrix , which also means that the set contains the optimal prediction for any  (see Lemma 9 in App. B for the proof). Importantly, the function  is jointly convex in the conditional probability  and parameters , which simplifies its analysis.

Lower bound on the calibration function. We now present our main technical result: a lower bound on the calibration function for the surrogate loss  (12). This lower bound characterizes the easiness of learning with this surrogate given the scaling intuition mentioned in Section 3.3. The proof of Theorem 7 is given in App. D.1.

###### Theorem 7 (Lower bound on HΦ{quad}).

For any task loss , its quadratic surrogate , and a score subspace  containing the column space of , the calibration function can be lower bounded:

where is the orthogonal projection on the subspace  and with being the -th basis vector of the standard basis in .

Lower bound for specific losses. We now discuss the meaning of the bound (14) for some specific losses (the detailed derivations are given in App. D.3). For the 0-1, block 0-1 and Hamming losses (, and , respectively) with the smallest possible score subspaces , the bound (14) gives  , and , respectively. All these bounds are tight (see App. E). However, if  the bound (14) is not tight for the block 0-1 and mixed losses (see also App. E). In particular, the bound (14) cannot detect level- consistency for (see Def. 3) and does not change when the loss changes, but the score subspace stays the same.

Upper bound on the calibration function. Theorem 8 below gives an upper bound on the calibration function holding for unconstrained scores, i.e, (see the proof in App. D.2). This result shows that without some appropriate constraints on the scores, efficient learning is not guaranteed (in the worst case) because of the  scaling of the calibration function.

###### Theorem 8 (Upper bound on HΦ{quad}).

If a loss matrix  with defines a pseudometric777A pseudometric is a function satisfying the following axioms: , (but possibly for some ), , . on labels and there are no constraints on the scores, i.e., , then the calibration function for the quadratic surrogate  can be upper bounded:

From our lower bound in Theorem 7 (which guarantees consistency), the natural constraint on the score is , with the dimension of this space giving an indication of the intrinsic “difficulty” of a loss. Computations for the lower bounds in some specific cases (see App. D.3 for details) show that the 0-1 loss is “hard” while the block 0-1 loss and the Hamming loss are “easy”. Note that in all these cases the lower bound (14) is tight, see the discussion below.

Exact calibration functions. Note that the bounds proven in Theorems 7 and 8 imply that, in the case of no constraints on the scores , for the 0-1, block 0-1 and Hamming losses, we have

where is the matrix defining a loss. For completeness, in App. E, we compute the exact calibration functions for the 0-1 and block 0-1 losses. Note that the calibration function for the 0-1 loss equals the lower bound, illustrating the worst-case scenario. To get some intuition, an example of a conditional distribution  that gives the (worst case) value to the calibration function (for several losses) is , and for . See the proof of Proposition 12 in App. E.1.

In what follows, we provide the calibration functions in the cases with constraints on the scores. For the block 0-1 loss with  equal blocks and under constraints that the scores within blocks are equal, the calibration function equals (see Proposition 14 of App. E.2)

For the Hamming loss defined over  binary variables and under constraints implying separable scores, the calibration function equals (see Proposition 15 in App. E.3)

The calibration functions (16) and (17) depend on the quantities representing the actual complexities of the loss (the number of blocks  and the length of the sequence ) and can be exponentially larger than the upper bound for the unconstrained case.

In the case of mixed 0-1 and block 0-1 loss, if the scores  are constrained to be equal inside the blocks, i.e., belong to the subspace , then the calibration function is equal to  for , implying inconsistency (and also note that the approximation error can be as big as  for ). However, for , the calibration function is of the order See Figure 1b for the illustration of this calibration function and Proposition 17 of App. E.4 for the exact formulation and the proof. Note that while the calibration function for the constrained case is inconsistent, its value can be exponentially larger than the one for the unconstrained case for big enough and when the blocks are exponentially large (see Proposition 16 of App. E.4).

Computation of the SGD constants. Applying the learning complexity Theorem 6 requires to compute the quantity where bounds the norm of the optimal solution and bounds the expected square of the norm of the stochastic gradient. In App. F, we provide a way to bound this quantity for our quadratic surrogate (12) under the simplifying assumption that each conditional  (seen as function of ) belongs to the RKHS  (which implies Assumption 4). In particular, we get

 DM=L2{max}ξ(κ(F)√rRQ{max}),ξ(z)=z2+z, (18)

where is the condition number of the matrix , is an upper bound on the RKHS norm of object feature maps . We define as an upper bound on (can be seen as the generalization of the inequality  for probabilities). The constants and depend on the data, the constant  depends on the loss, and depend on the choice of matrix .

We compute the constant for the specific losses that we considered in App. F.1. For the 0-1, block 0-1 and Hamming losses, we have , and , respectively. These computations indicate that the quadratic surrogate allows efficient learning for structured block 0-1 and Hamming losses, but that the convergence could be slow in the worst case for the 0-1 loss.

## 5 Related works

Consistency for multi-class problems. Building on significant progress for the case of binary classification, see, e.g. Bartlett et al. (2006), there has been a lot of interest in the multi-class case. Zhang (2004a) and Tewari & Bartlett (2007) analyze the consistency of many existing surrogates for the 0-1 loss. Gao & Zhou (2011) focus on multi-label classification. Narasimhan et al. (2015) provide a consistent algorithm for arbitrary multi-class loss defined by a function of the confusion matrix. Recently, Ramaswamy & Agarwal (2016) introduce the notion of convex calibrated dimension, as the minimal dimensionality of the score vector that is required for consistency. In particular, they showed that for the Hamming loss on  binary variables, this dimension is at most . In our analysis, we use scores of rank , see (35) in App. D.3, yielding a similar result.

The task of ranking has attracted a lot of attention and Duchi et al. (2010); Buffoni et al. (2011); Calauzènes et al. (2012); Ramaswamy et al. (2013) analyze different families of surrogate and task losses proving their (in-)consistency. In this line of work, Ramaswamy et al. (2013) propose a quadratic surrogate for an arbitrary low rank loss which is related to our quadratic surrogate (12). They also prove that several important ranking losses, i.e., precision@q, expected rank utility, mean average precision and pairwise disagreement, are of low-rank. We conjecture that our approach is compatible with these losses and leave precise connections as future work.

Structured SVM (SSVM) and friends. SSVM (Taskar et al., 2003, 2005; Tsochantaridis et al., 2005) is one of the most used convex surrogates for tasks with structured outputs, thus, its consistency has been a question of great interest. It is known that Crammer-Singer multi-class SVM (Crammer & Singer, 2001), which SSVM is built on, is not consistent for 0-1 loss unless there is a majority class with probability at least  (Zhang, 2004a; McAllester, 2007). However, it is consistent for the “abstain” and ordinal losses in the case of  classes Ramaswamy & Agarwal (2016). Structured ramp loss and probit surrogates are closely related to SSVM and are consistent (McAllester, 2007; Do et al., 2009; McAllester & Keshet, 2011; Keshet, 2014), but not convex.

Recently, Doğan et al. (2016) categorized different versions of multi-class SVM and analyzed them from Fisher and universal consistency point of views. In particular, they highlight differences between Fisher and universal consistency and give examples of surrogates that are Fisher consistent, but not universally consistent and vice versa. They also highlight that the Crammer-Singer SVM is neither Fisher, not universally consistent even with a careful choice of regularizer.

Quadratic surrogates for structured prediction. Ciliberto et al. (2016) and Brouard et al. (2016) consider minimizing  aiming to match the RKHS embedding of inputs  to the feature maps of outputs . In their frameworks, the task loss is not considered at the learning stage, but only at the prediction stage. Our quadratic surrogate (12) depends on the loss directly. The empirical risk defined by both their and our objectives can be minimized analytically with the help of the kernel trick and, moreover, the resulting predictors are identical. However, performing such computation in the case of large dataset can be intractable and the generalization properties have to be taken care of, e.g., by the means of regularization. In the large-scale scenario, it is more natural to apply stochastic optimization (e.g., kernel ASGD) that directly minimizes the population risk and has better dependency on the dataset size. When combined with stochastic optimization, the two approaches lead to different behavior. In our framework, we need to estimate  scalar functions, but the alternative needs to estimate  functions (if, e.g., ), which results in significant differences for low-rank losses, such as block 0-1 and Hamming.

Calibration functions. Bartlett et al. (2006) and Steinwart (2007) provide calibration functions for most existing surrogates for binary classification. All these functions differ in term of shape, but are roughly similar in terms of constants. Pedregosa et al. (2017) generalize these results to the case of ordinal regression. However, their calibration functions have at best a factor if the surrogate is normalized w.r.t. the number of classes. The task of ranking has been of significant interest. However, most of the literature (e.g., Clémençon et al., 2008; Cossock & Zhang, 2008; Kotlowski et al., 2011; Agarwal, 2014), only focuses on calibration functions (in the form of regret bounds) for bipartite ranking, which is more akin to cost-sensitive binary classification.

Ávila Pires et al. (2013) generalize the theoretical framework developed by Steinwart (2007) and present results for the multi-class SVM of Lee et al. (2004) (the score vectors are constrained to sum to zero) that can be built for any task loss of interest. Their surrogate is of the form where and is some convex function with all subgradients at zero being positive. The recent work by Ávila Pires & Szepesvári (2016) refines the results, but specifically for the case of 0-1 loss. In this line of work, the surrogate is typically not normalized by , and if normalized the calibration functions have the constant appearing.

Finally, Ciliberto et al. (2016) provide the calibration function for their quadratic surrogate. Assuming that the loss can be represented as , (this assumption can always be satisfied in the case of a finite number of labels, by taking as the loss matrix  and where is the -th vector of the standard basis in ). In their Theorem 2, they provide an excess risk bound leading to a lower bound on the corresponding calibration function where a constant simply equals the spectral norm of the loss matrix for the finite-dimensional construction provided above. However, the spectral norm of the loss matrix is exponentially large even for highly structured losses such as the block 0-1 and Hamming losses, i.e., , . This conclusion puts the objective of Ciliberto et al. (2016) in line with ours when no constraints are put on the scores.

## 6 Conclusion

In this paper, we studied the consistency of convex surrogate losses specifically in the context of structured prediction. We analyzed calibration functions and proposed an optimization-based normalization aiming to connect consistency with the existence of efficient learning algorithms. Finally, we instantiated all components of our framework for several losses by computing the calibration functions and the constants coming from the normalization. By carefully monitoring exponential constants, we highlighted the difference between tractable and intractable task losses.

These were first steps in advancing our theoretical understanding of consistent structured prediction. Further steps include analyzing more losses such as the low-rank ranking losses studied by Ramaswamy et al. (2013) and, instead of considering constraints on the scores, one could instead put constraints on the set of distributions to investigate the effect on the calibration function.

#### Acknowledgements

We would like to thank Pascal Germain for useful discussions. This work was partly supported by the ERC grant Activia (no. 307574), the NSERC Discovery Grant RGPIN-2017-06936 and the MSR-INRIA Joint Center.

## References

• Agarwal (2014) Agarwal, Shivani. Surrogate regret bounds for bipartite ranking via strongly proper losses. Journal of Machine Learning Research (JMLR), 15(1):1653–1674, 2014.
• Ávila Pires & Szepesvári (2016) Ávila Pires, Bernardo and Szepesvári, Csaba. Multiclass classification calibration functions. arXiv, 1609.06385v1, 2016.
• Ávila Pires et al. (2013) Ávila Pires, Bernardo, Ghavamzadeh, Mohammad, and Szepesvári, Csaba. Cost-sensitive multiclass classification risk bounds. In ICML, 2013.
• Bakir et al. (2007) Bakir, Gökhan, Hofmann, Thomas, Schölkopf, Bernhard, Smola, Alexander J., Taskar, Ben, and Vishwanathan, S.V.N. Predicting Structured Data. MIT press, 2007.
• Bartlett et al. (2006) Bartlett, Peter L., Jordan, Michael I., and McAuliffe, Jon D. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
• Bousquet & Bottou (2008) Bousquet, Olivier and Bottou, Léon. The tradeoffs of large scale learning. In NIPS, 2008.
• Brouard et al. (2016) Brouard, Céline, Szafranski, Marie, and d’Alché-Buc, Florence. Input output kernel regression: Supervised and semi-supervised structured output prediction with operator-valued kernels. Journal of Machine Learning Research (JMLR), 17(176):1–48, 2016.
• Buffoni et al. (2011) Buffoni, David, Gallinari, Patrick, Usunier, Nicolas, and Calauzènes, Clément. Learning scoring functions with order-preserving losses and standardized supervision. In ICML, 2011.
• Calauzènes et al. (2012) Calauzènes, Clément, Usunier, Nicolas, and Gallinari, Patrick. On the (non-)existence of convex, calibrated surrogate losses for ranking. In NIPS, 2012.
• Ciliberto et al. (2016) Ciliberto, Carlo, Rosasco, Lorenzo, and Rudi, Alessandro. A consistent regularization approach for structured prediction. In NIPS, 2016.
• Clémençon et al. (2008) Clémençon, Stéphan, Lugosi, Gábor, and Vayatis, Nicolas. Ranking and empirical minimization of U-statistics. The Annals of Statistics, pp. 844–874, 2008.
• Collins (2002) Collins, Michael. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In EMNLP, 2002.
• Cortes et al. (2016) Cortes, Corinna, Kuznetsov, Vitaly, Mohri, Mehryar, and Yang, Scott. Structured prediction theory based on factor graph complexity. In NIPS, 2016.
• Cossock & Zhang (2008) Cossock, David and Zhang, Tong. Statistical analysis of bayes optimal subset ranking. IEEE Transactions on Information Theory, 54(11):5140–5154, 2008.
• Crammer & Singer (2001) Crammer, Koby and Singer, Yoram. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research (JMLR), 2:265–292, 2001.
• Do et al. (2009) Do, Chuong B., Le, Quoc, Teo, Choon Hui, Chapelle, Olivier, and Smola, Alex. Tighter bounds for structured estimation. In NIPS, 2009.
• Doğan et al. (2016) Doğan, Ürün, Glasmachers, Tobias, and Igel, Christian. A unified view on multi-class support vector classification. Journal of Machine Learning Research (JMLR), 17(45):1–32, 2016.
• Duchi et al. (2010) Duchi, John C., Mackey, Lester W., and Jordan, Michael I. On the consistency of ranking algorithms. In ICML, 2010.
• Durbin et al. (1998) Durbin, Richard, Eddy, Sean, Krogh, Anders, and Mitchison, Graeme. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge university press, 1998.
• Gao & Zhou (2011) Gao, Wei and Zhou, Zhi-Hua. On the consistency of multi-label learning. In COLT, 2011.
• Gimpel & Smith (2010) Gimpel, Kevin and Smith, Noah A. Softmax-margin CRFs: Training loglinear models with cost functions. In NAACL, 2010.
• Hazan & Urtasun (2010) Hazan, Tamir and Urtasun, Raquel. A primal-dual message-passing algorithm for approximated large scale structured prediction. In NIPS, 2010.
• Keshet (2014) Keshet, Joseph. Optimizing the measure of performance in structured prediction. In Advanced Structured Prediction. MIT Press, 2014.
• Kotlowski et al. (2011) Kotlowski, Wojciech, Dembczynski, Krzysztof, and Huellermeier, Eyke. Bipartite ranking through minimization of univariate loss. In ICML, 2011.
• Lafferty et al. (2001) Lafferty, John, McCallum, Andrew, and Pereira, Fernando. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, 2001.
• Lee et al. (2004) Lee, Yoonkyung, Lin, Yi, and Wahba, Grace. Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association, 99(465):67–81, 2004.
• Lin (2004) Lin, Yi. A note on margin-based loss functions in classification. Statistics & Probability Letters, 68(1):73–82, 2004.
• London et al. (2016) London, Ben, Huang, Bert, and Getoor, Lise. Stability and generalization in structured prediction. Journal of Machine Learning Research (JMLR), 17(222):1–52, 2016.
• Long & Servedio (2013) Long, Phil and Servedio, Rocco. Consistency versus realizable H-consistency for multiclass classification. In ICML, 2013.
• McAllester & Keshet (2011) McAllester, D. A. and Keshet, J. Generalization bounds and consistency for latent structural probit and ramp loss. In NIPS, 2011.
• McAllester (2007) McAllester, David. Generalization bounds and consistency for structured labeling. In Predicting Structured Data. MIT Press, 2007.
• Narasimhan et al. (2015) Narasimhan, Harikrishna, Ramaswamy, Harish G., Saha, Aadirupa, and Agarwal, Shivani. Consistent multiclass algorithms for complex performance measures. In ICML, 2015.
• Nemirovski et al. (2009) Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
• Nowozin & Lampert (2011) Nowozin, Sebastian and Lampert, Christoph H. Structured learning and prediction in computer vision. Foundations and Trends in Computer Graphics and Vision, 6(3–4):185–365, 2011.
• Nowozin et al. (2014) Nowozin, Sebastian, Gehler, Peter V., Jancsary, Jeremy, and Lampert, Christoph H. Advanced Structured Prediction. MIT Press, 2014.
• Orabona (2014) Orabona, Francesco. Simultaneous model selection and optimization through parameter-free stochastic learning. In NIPS, 2014.
• Pedregosa et al. (2017) Pedregosa, Fabian, Bach, Francis, and Gramfort, Alexandre. On the consistency of ordinal regression methods. Journal of Machine Learning Research (JMLR), 18(55):1–35, 2017.
• Pletscher et al. (2010) Pletscher, Patrick, Ong, Cheng Soon, and Buhmann, Joachim M. Entropy and margin maximization for structured output learning. In ECML PKDD, 2010.
• Ramaswamy & Agarwal (2016) Ramaswamy, Harish G. and Agarwal, Shivani. Convex calibration dimension for multiclass loss matrices. Journal of Machine Learning Research (JMLR), 17(14):1–45, 2016.
• Ramaswamy et al. (2013) Ramaswamy, Harish G., Agarwal, Shivani, and Tewari, Ambuj. Convex calibrated surrogates for low-rank loss matrices with applications to subset ranking losses. In NIPS, 2013.
• Shi et al. (2015) Shi, Qinfeng, Reid, Mark, Caetano, Tiberio, van den Hengel, Anton, and Wang, Zhenhua. A hybrid loss for multiclass and structured prediction. IEEE transactions on pattern analysis and machine intelligence (TPAMI), 37(1):2–12, 2015.
• Smith (2011) Smith, Noah A. Linguistic structure prediction. Synthesis lectures on human language technologies, 4(2):1–274, 2011.
• Steinwart (2007) Steinwart, Ingo. How to compare different loss functions and their risks. Constructive Approximation, 26(2):225–287, 2007.
• Taskar et al. (2003) Taskar, Ben, Guestrin, Carlos, and Koller, Daphne. Max-margin markov networks. In NIPS, 2003.
• Taskar et al. (2005) Taskar, Ben, Chatalbashev, Vassil, Koller, Daphne, and Guestrin, Carlos. Learning structured prediction models: a large margin approach. In ICML, 2005.
• Tewari & Bartlett (2007) Tewari, Ambuj and Bartlett, Peter L. On the consistency of multiclass classification methods. Journal of Machine Learning Research (JMLR), 8:1007–1025, 2007.
• Tsochantaridis et al. (2005) Tsochantaridis, I., Joachims, T., Hofmann, T., and Altun, Y. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research (JMLR), 6:1453–1484, 2005.
• Williamson et al. (2016) Williamson, Robert C., Vernet, Elodie, and Reid, Mark D. Composite multiclass losses. Journal of Machine Learning Research (JMLR), 17(223):1–52, 2016.
• Zhang (2004a) Zhang, Tong. Statistical analysis of some multi-category large margin classification methods. Journal of Machine Learning Research (JMLR), 5:1225–1251, 2004a.
• Zhang (2004b) Zhang, Tong. Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics, 32(1):56–134, 2004b.

Supplementary Material (Appendix)

[4mm] On Structured Prediction Theory with Calibrated

Convex Surrogate Losses

## Outline

Section A:

Proof of learning complexity Theorem 6.

Section B:

Technical lemmas useful for the proofs.

Section C:

Discussion and consistency results on a family of surrogate losses.

Section D:

Bounds on the calibration functions.

Section D.1:

Theorem 7 – a lower bound.

Section D.2:

Theorem 8 – an upper bound.

Section D.3:

Computation of the bounds for specific task losses.

Section E:

Computations of the exact calibration functions for the quadratic surrogate.

Section E.1:

0-1 loss.

Section E.2:

Block 0-1 loss.

Section E.3:

Hamming loss.

Section E.4:

Mixed 0-1 and block 0-1 loss.

Section F:

Computing constants appearing in the SGD rate.

Section G:

Properties of the basis of the Hamming loss.

## Appendix A Learning complexity theorem

###### Theorem 6 (Learning complexity).

Under the assumptions of Theorem 5, for any , the random (w.r.t. the observed training set) output  of the ASGD algorithm after

 N>N∗:=4D2M2(()ˇH2Φ,L,F(ε) (19)

iterations has the expected excess risk bounded with , i.e.,

###### Proof.

By (10) from Theorem 5, steps of the algorithm, in expectation, result in  accuracy on the surrogate risk, i.e., . We now generalize the proof of Theorem 2 to the case of expectation w.r.t.  depending on the random samples used by the ASGD algorithm. We take the expectation of (4) w.r.t.  substituted as  and use Jensen’s inequality (by convexity of ) to get . Finally, monotonicity of  implies . ∎

## Appendix B Technical lemmas

In this section, we prove two technical lemmas that simplify the proofs of the main theoretical claims of the paper.

Lemma 9 computes the excess of the weighted surrogate risk  for the quadratic loss  (12), which is central to our analysis presented in Section 4. The key property of this result is that the excess  is jointly convex w.r.t. the parameters  and conditional distribution , which simplifies further analysis.

Lemma 10 allows to cope with the combinatorial aspect of the computation of the calibration function. In particular, when the excess of the weighted surrogate risk is convex, Lemma 10 reduces the computation of the calibration function to a set of convex optimization problems, which often can be solved analytically. For symmetric losses, such as the 0-1, block 0-1 and Hamming losses, Lemma 10 also provides “symmetry breaking”, meaning that many of the obtained convex optimization problems are identical up to a permutation of labels.

###### Lemma 9.

Consider the quadratic surrogate  (12) defined for a task loss . Let a subspace of scores  be parametrized by , i.e., with , and assume that . Then, the excess of the weighted surrogate loss can be expressed as

###### Proof.

By using the definition of the quadratic surrogate  (12), we have

 ϕ(f(θ),q) =12k(θTFTFθ+2θTFTLq)+r(q), θ∗ δϕ(f(θ),q) =12k(θTFTFθ+2θTFTLq +qTLTF(FTF)†FTLq),

where denotes the quantity independent of parameters . Note that is the orthogonal projection on the subspace , so if we have which finishes the proof. ∎

###### Lemma 10.

In the case of a finite number  of labels, for any task loss , a surrogate loss  that is continuous and bounded from below, and a set of scores , the calibration function can be written as

 HΦ,L,F(ε)=mini,j∈{pred% }(F)i≠jHij(ε), (20)

where the set is defined as the set of labels that the predictor can predict for some feasible scores and is defined via minimization of the same objective as (5), but w.r.t. a smaller domain:

 Hij(ε)=inff,q δϕ(f,q), (21) s.t. ℓi(q)≤ℓj(q)−ε, ℓi(q)≤ℓc(q),∀c∈{% pred}(F), fj≥fc,∀c∈{pred}(F), f∈F, q∈Δk.

Here is the expected loss if predicting label . Index  represents a label with the smallest expected loss while index  represents a label with the largest score.

###### Proof.

We use the notation to define the set of score vectors  where the predictor takes a value , i.e., . The union of the sets , , equals the whole set . It is possible that sets do not fully contain their boundary because of the usage of a particular tie-breaking strategy, but their closure can be expressed as .

If , i.e. , then the feasible set of probability vectors  for which a label  is one of the best possible predictions (i.e. ) is

 Δk,i,j,ε:={q∈Δk∣ℓi(q)≤ℓc(q),∀c∈{pred}(F);ℓj(q)−ℓi(q)≥ε},

because .

The union of the sets  thus exactly equals the feasibility set of the optimization problem (5)-(6) (note that this is not true for the union of the sets , which can be strictly larger), thus we can rewrite the definition of the calibration function as follows:

 HΦ,L,F(ε)=mini,j∈{pred% }(F)i≠jinff∈Fj,q∈Δk,i,j,εδϕ(f,q). (22)

To finish the proof, we use Lemma 27 of (Zhang, 2004a) claiming that the function  is continuous w.r.t. both  and , which allows us to substitute sets in (22) with their closures  without changing the value of the infimum. ∎

## Appendix C Consistent surrogate losses

An ideal surrogate should not only be consistent, but also allow efficient optimization, by, e.g., being convex and allowing fast computation of stochastic gradients. In this paper, we study a generalization to arbitrary multi-class losses of a surrogate loss class from Zhang (2004a, Section 4.4.2)888Zhang (2004a) refers to this surrogate as “decoupled unconstrained background discriminative surrogate”. Note the  scaling to make of order . that satisfies these requirements:

 Φa,b(f,y):=1k∑kc=1(L(c,y)a(fc)+b(fc)), (23)

where  are convex functions. A generic method to minimize this surrogate is to use any version of the SGD algorithm, while computing the stochastic gradient by sampling  from the data generating distribution and a label  uniformly. In the case of the quadratic surrogate , we proposed instead in the main paper to compute the sum over  analytically instead of sampling .

Extending the argument from Zhang (2004a), we show that the surrogates of the form (23) are consistent w.r.t. a task loss  under some sufficient assumptions formalized in Theorem 11.

###### Theorem 11 (Sufficient conditions for consistency).

The surrogate loss  is consistent w.r.t. a task loss , i.e., for any , under the following conditions on the functions  and :

1. The functions  and are convex and differentiable.

2. The function is bounded from below and has a unique global minimizer (finite or infinite) for all .

3. The functions  and  are strictly increasing.

###### Proof.

Consider an arbitrary conditional probability vector . Assumption 2 then implies that the global minimizer  of the conditional surrogate risk  w.r.t.  is unique. Assumption 1 allows us to set the derivatives to zero and obtain  where . Assumption 3 then implies that  holds if and only if .

Now, we will prove by contradiction that for any . Assume that for some  we have . Lemma 10 then implies that for some , , we have . Note that the domain of (21) defining is separable w.r.t.  and . We can now rewrite (21) as

 Hij(ε)=infq∈Δk,i,j,εδϕ∗(q),whereδϕ∗(q):=inff∈¯¯¯¯Fjδϕ(f,q),

where and are defined in the proof of Lemma 10. Lemma 27 of (Zhang, 2004a) implies that the function  is a continuous function of . Given that  is a compact set, the infimum is achieved at some point