On Structured Prediction Theory with Calibrated Convex Surrogate Losses
We provide novel theoretical insights on structured prediction in the context of efficient convex surrogate loss minimization with consistency guarantees. For any task loss, we construct a convex surrogate that can be optimized via stochastic gradient descent and we prove tight bounds on the so-called “calibration function” relating the excess surrogate risk to the actual risk. In contrast to prior related work, we carefully monitor the effect of the exponential number of classes in the learning guarantees as well as on the optimization complexity. As an interesting consequence, we formalize the intuition that some task losses make learning harder than others, and that the classical 0-1 loss is ill-suited for structured prediction.
On Structured Prediction Theory with Calibrated Convex Surrogate Losses
Anton Osokin INRIA/ENS††thanks: DI École normale supérieure, CNRS, PSL Research University, Paris, France HSE††thanks: National Research University Higher School of Economics, Moscow, Russia Francis Bach INRIA/ENS11footnotemark: 1, Paris, France Simon Lacoste-Julien MILA and DIRO Université de Montréal, Canada
noticebox[b]31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\end@float
Structured prediction is a subfield of machine learning aiming at making multiple interrelated predictions simultaneously. The desired outputs (labels) are typically organized in some structured object such as a sequence, a graph, an image, etc. Tasks of this type appear in many practical domains such as computer vision (Nowozin & Lampert, 2011), natural language processing (Smith, 2011) and bioinformatics (Durbin et al., 1998).
The structured prediction setup has at least two typical properties differentiating it from the classical binary classification problems extensively studied in learning theory:
Exponential number of classes: this brings both additional computational and statistical challenges. By exponential, we mean exponentially large in the size of the natural dimension of output, e.g., the number of all possible sequences is exponential w.r.t. the sequence length.
Cost-sensitive learning: in typical applications, prediction mistakes are not all equally costly. The prediction error is usually measured with a highly-structured task-specific loss function, e.g., Hamming distance between sequences of multi-label variables or mean average precision for ranking.
Despite many algorithmic advances to tackle structured prediction problems (Bakir et al., 2007; Nowozin et al., 2014), there have been relatively few papers devoted to its theoretical understanding. Notable recent exceptions that made significant progress include Cortes et al. (2016) and London et al. (2016) (see references therein) which proposed data-dependent generalization error bounds in terms of popular empirical convex surrogate losses such as the structured hinge loss (Taskar et al., 2003, 2005; Tsochantaridis et al., 2005). A question not addressed by these works is whether their algorithms are consistent: does minimizing their convex bounds with infinite data lead to the minimization of the task loss as well? Alternatively, the structured probit and ramp losses are consistent McAllester (2007); McAllester & Keshet (2011), but non-convex and thus it is hard to obtain computational guarantees for them. In this paper, we aim at getting the property of consistency for surrogate losses that can be efficiently minimized with guarantees, and thus we consider convex surrogate losses.
The consistency of convex surrogates is well understood in the case of binary classification (Zhang, 2004b; Bartlett et al., 2006; Steinwart, 2007) and there is significant progress in the case of multi-class 0-1 loss (Zhang, 2004a; Tewari & Bartlett, 2007) and general multi-class loss functions (Ávila Pires et al., 2013; Ramaswamy & Agarwal, 2016; Williamson et al., 2016). A large body of work specifically focuses on the related tasks of ranking Duchi et al. (2010); Calauzènes et al. (2012); Ramaswamy et al. (2013) and ordinal regression (Pedregosa et al., 2017).
Contributions. In this paper, we study consistent convex surrogate losses specifically in the context of an exponential number of classes. We argue that even while being consistent, a convex surrogate might not allow efficient learning. As a concrete example, Ciliberto et al. (2016) recently proposed a consistent approach to structured prediction, but the constant in their generalization error bound can be exponentially large as we explain in Section 5. There are two possible sources of difficulties from the optimization perspective: to reach adequate accuracy on the task loss, one might need to optimize a surrogate loss to exponentially small accuracy; or to reach adequate accuracy on the surrogate loss, one might need an exponential number of algorithm steps because of exponentially large constants in the convergence rate. We propose a theoretical framework that jointly tackles these two aspects and allows to judge the feasibility of efficient learning. In particular, we construct a calibration function Steinwart (2007), i.e., a function setting the relationship between accuracy on the surrogate and task losses, and normalize it by the means of convergence rate of an optimization algorithm.
Aiming for the simplest possible application of our framework, we propose a family of convex surrogates that are consistent for any given task loss and can be optimized using stochastic gradient descent. For a special case of our family (quadratic surrogate), we provide a complete analysis including general lower and upper bounds on the calibration function for any task loss, with exact values for the 0-1, block 0-1 and Hamming losses. We observe that to have a tractable learning algorithm, one needs both a structured loss (not the 0-1 loss) and appropriate constraints on the predictor, e.g., in the form of linear constraints for the score vector functions. Our framework also indicates that in some cases it might be beneficial to use non-consistent surrogates. In particular, a non-consistent surrogate might allow optimization only up to specific accuracy, but exponentially faster than a consistent one.
2 Structured prediction setup
In structured prediction, the goal is to predict a structured output (such as a sequence, a graph, an image) given an input . The quality of prediction is measured by a task-dependent loss function specifying the cost for predicting when the correct output is . In this paper, we consider the case when the number of possible predictions and the number of possible labels are both finite. For simplicity,111Our analysis is generalizable to rectangular losses, e.g., ranking losses studied by Ramaswamy et al. (2013). we also assume that the sets of possible predictions and correct outputs always coincide and do not depend on . We refer to this set as the set of labels , denote its cardinality by , and map its elements to . In this setting, assuming that the loss function depends only on and , but not on directly, the loss is defined by a loss matrix . We assume that all the elements of the matrix are non-negative and will use to denote the maximal element. Compared to multi-class classification, is typically exponentially large in the size of the natural dimension of , e.g., contains all possible sequences of symbols from a finite alphabet.
Following standard practices in structured prediction (Collins, 2002; Taskar et al., 2003), we define the prediction model by a score function specifying a score for each possible output . The final prediction is done by selecting a label with the maximal value of the score
with some fixed strategy to resolve ties. To simplify the analysis, we assume that among the labels with maximal scores, the predictor always picks the one with the smallest index.
The goal of prediction-based machine learning consists in finding a predictor that works well on the unseen test set, i.e., data points coming from the same distribution as the one generating the training data. One way to formalize this is to minimize the generalization error, often referred to as the actual (or population) risk based on the loss ,
Minimizing the actual risk (2) is usually hard. The standard approach is to minimize a surrogate risk, which is a different objective easier to optimize, e.g., convex. We define a surrogate loss as a function depending on a score vector and a target label as input arguments. We denote the -th component of with . The surrogate risk (the -risk) is defined as
where the expectation is taken w.r.t. the data-generating distribution . To make the minimization of (3) well-defined, we always assume that the surrogate loss is bounded from below and continuous.
Examples of common surrogate losses include the structured hinge-loss (Taskar et al., 2003; Tsochantaridis et al., 2005) the log loss (maximum likelihood learning) used, e.g., in conditional random fields (Lafferty et al., 2001), and their hybrids (Pletscher et al., 2010; Gimpel & Smith, 2010; Hazan & Urtasun, 2010; Shi et al., 2015).
In terms of task losses, we consider the unstructured 0-1 loss ,222Here we use the Iverson bracket notation, i.e., if a logical expression is true, and zero otherwise. and the two following structured losses: block 0-1 loss with equal blocks of labels ; and (normalized) Hamming loss between tuples of binary variables : . To illustrate some aspects of our analysis, we also look at the mixed loss : a convex combination of the 0-1 and block 0-1 losses, defined as for some .
3 Consistency for structured prediction
3.1 Calibration function
We now formalize the connection between the actual risk and the surrogate -risk via the so-called calibration function, see Definition 1 below (Bartlett et al., 2006; Zhang, 2004a; Steinwart, 2007; Duchi et al., 2010; Ávila Pires et al., 2013). As it is standard for this kind of analysis, the setup is non-parametric, i.e. it does not take into account the dependency of scores on input variables . For now, we assume that a family of score functions consists of all vector-valued Borel measurable functions where is a subspace of allowed score vectors, which will play an important role in our analysis. This setting is equivalent to a pointwise analysis, i.e, looking at the different input independently. We bring the dependency on the input back into the analysis in Section 3.3 where we assume a specific family of score functions.
Let represent the marginal distribution for on and denote its conditional given . We can now rewrite the risk and -risk as
where the conditional risk and the conditional -risk depend on a vector of scores and a conditional distribution on the set of output labels as
The calibration function between the surrogate loss and the task loss relates the excess surrogate risk with the actual excess risk via the excess risk bound:
where , are the excess risks and denotes the probability simplex on elements.
In other words, to find a vector that yields an excess risk smaller than , we need to optimize the -risk up to accuracy (in the worst case). We make this statement precise in Theorem 2 below, and now proceed to the formal definition of the calibration function.
Definition 1 (Calibration function).
For a task loss , a surrogate loss , a set of feasible scores , the calibration function (defined for ) equals the infimum excess of the conditional surrogate risk when the excess of the conditional actual risk is at least :
We set to when the feasible set is empty.
|(a): Hamming loss||(b): Mixed loss|
3.2 Notion of consistency
Theorem 2 (Calibration connection).
Let be the calibration function between the surrogate loss and the task loss with feasible set of scores . Let be a convex non-decreasing lower bound of the calibration function. Assume that is continuous and bounded from below. Then, for any with finite and any , we have
where and .
A suitable convex non-decreasing lower bound required by Theorem 2 always exists, e.g., the zero constant. However, in this case Theorem 2 is not informative, because the l.h.s. of (7) is never true. Zhang (2004a, Proposition 25) claims that defined as the lower convex envelope of the calibration function satisfies , , if , , and, e.g., the set of labels is finite. This statement implies that an informative always exists and allows to characterize consistency through properties of the calibration function .
We now define a notion of level- consistency, which is more general than consistency.
Definition 3 (level- consistency).
A surrogate loss is consistent up to level w.r.t. a task loss and a set of scores if and only if the calibration function satisfies for all and there exists such that is finite.
Looking solely at (standard level-) consistency vs. inconsistency might be too coarse to capture practical properties related to optimization accuracy (see, e.g., (Long & Servedio, 2013)). For example, if only for very small values of , then the method can still optimize the actual risk up to a certain level which might be good enough in practice, especially if it means that it can be optimized faster. Examples of calibration functions for consistent and inconsistent surrogate losses are shown in Figure 1.
Other notions of consistency. Definition 3 with and results in the standard setting often appearing in the literature. In particular, in this case Theorem 2 implies Fisher consistency as formulated, e.g., by Pedregosa et al. (2017) for general losses and Lin (2004) for binary classification. This setting is also closely related to many definitions of consistency used in the literature. For example, for a bounded from below and continuous surrogate, it is equivalent to infinite-sample consistency (Zhang, 2004a), classification calibration (Tewari & Bartlett, 2007), edge-consistency (Duchi et al., 2010), -calibration (Ramaswamy & Agarwal, 2016), prediction calibration (Williamson et al., 2016). See (Zhang, 2004a, Appendix A) for the detailed discussion.
Role of . Let the approximation error for the restricted set of scores be defined as . For any conditional distribution , the score vector will yield an optimal prediction. Thus the condition is sufficient for to have zero approximation error for any distribution , and for our -consistency condition to imply the standard Fisher consistency with respect to . In the following, we will see that a restricted can both play a role for computational efficiency as well as statistical efficiency (thus losses with smaller might be easier to work with).
3.3 Connection to optimization accuracy and statistical efficiency
The scale of a calibration function is not intrinsically well-defined: we could multiply the surrogate function by a scalar and it would multiply the calibration function by the same scalar, without changing the optimization problem. Intuitively, we would like the surrogate loss to be of order . If with this scale the calibration function is exponentially small (has a factor), then we have strong evidence that the stochastic optimization will be difficult (and thus learning will be slow).
To formalize this intuition, we add to the picture the complexity of optimizing the surrogate loss with a stochastic approximation algorithm. By using a scale-invariant convergence rate, we provide a natural normalization of the calibration function. The following two observations are central to the theoretical insights provided in our work:
Scale. For a properly scaled surrogate loss, the scale of the calibration function is a good indication of whether a stochastic approximation algorithm will take a large number of iterations (in the worst case) to obtain guarantees of small excess of the actual risk (and vice-versa, a large coefficient indicates a small number of iterations). The actual verification requires computing the normalization quantities given in Theorem 6 below.
Statistics. The bound on the number of iterations directly relates to the number of training examples that would be needed to learn, if we see each iteration of the stochastic approximation algorithm as using one training example to optimize the expected surrogate.
To analyze the statistical convergence of surrogate risk optimization, we have to specify the set of score functions that we work with. We assume that the structure on input is defined by a positive definite kernel . We denote the corresponding reproducing kernel Hilbert space (RKHS) by and its explicit feature map by . By the reproducing property, we have for all , , where is the inner product in the RKHS. We define the subspace of allowed scores via the span of the columns of a matrix . The matrix explicitly defines the structure of the score function. With this notation, we will assume that the score function is of the form , where is a linear operator to be learned (a matrix if is of finite dimension) that represents a collection of elements in , transforming to a vector in by applying the RKHS inner product times.333Note that if , our setup is equivalent to assuming a joint kernel (Tsochantaridis et al., 2005) in the product form: , where is the row for matrix . Note that for structured losses, we usually have . The set of all score functions is thus obtained by varying in this definition and is denoted by . As a concrete example of a score family for structured prediction, consider the standard sequence model with unary and pairwise potentials. In this case, the dimension equals , where is the sequence length and is the number of labels of each variable. The columns of the matrix consist of groups (one for each unary and pairwise potential). Each row of has exactly one entry equal to one in each column group (with zeros elsewhere).
In this setting, we use the online projected averaged stochastic subgradient descent ASGD444See, e.g., Orabona (2014) for the formal setup of kernel ASGD. (stochastic w.r.t. data ) to minimize the surrogate risk directly Bousquet & Bottou (2008). The -th update consists in
where is the stochastic functional gradient, is the step size and is the projection on the ball of radius w.r.t. the Hilbert–Schmidt norm555The Hilbert–Schmidt norm of a linear operator is defined as where is the adjoint operator. In the case of finite dimension, the Hilbert–Schmidt norm coincides with the Frobenius matrix norm.. The vector is a regular gradient of the sampled surrogate w.r.t. the scores, . We wrote the above update using an explicit feature map for notational simplicity, but kernel ASGD can also be implemented without it by using the kernel trick. The convergence properties of ASGD in RKHS are analogous to the finite-dimensional ASGD because they rely on dimension-free quantities. To use a simple convergence analysis, we follow Ciliberto et al. (2016) and make the following simplifying assumption:
Assumption 4 (Well-specified optimization w.r.t. the function class ).
The distribution is such that has some global minimum that also belongs to .
Assumption 4 simply means that each row of defining belongs to the RKHS implying a finite norm . Assumption 4 can be relaxed if the kernel is universal, but then the convergence analysis becomes much more complicated Orabona (2014).
Theorem 5 (Convergence rate).
Under Assumption 4 and assuming that (i) the functions are bounded from below and convex w.r.t. for all ; (ii) the expected square of the norm of the stochastic gradient is bounded, and (iii) , then running the ASGD algorithm (9) with the constant step-size for steps admits the following expected suboptimality for the averaged iterate :
By combining the convergence rate of Theorem 5 with Theorem 2 that connects the surrogate and actual risks, we get Theorem 6 which explicitly gives the number of iterations required to achieve accuracy on the expected population risk (see App. A for the proof). Note that since ASGD is applied in an online fashion, Theorem 6 also serves as the sample complexity bound, i.e., says how many samples are needed to achieve target accuracy (compared to the best prediction rule if has zero approximation error).
Theorem 6 (Learning complexity).
Under the assumptions of Theorem 5, for any , the random (w.r.t. the observed training set) output of the ASGD algorithm after
iterations has the expected excess risk bounded with , i.e.,
4 Calibration function analysis for quadratic surrogate
A major challenge to applying Theorem 6 is the computation of the calibration function . In App. C, we present a generalization to arbitrary multi-class losses of a surrogate loss class from Zhang (2004a, Section 4.4.2) that is consistent for any task loss . Here, we consider the simplest example of this family, called the quadratic surrogate , which has the advantage that we can bound or even compute exactly its calibration function. We define the quadratic surrogate as
One simple sufficient condition for the surrogate (12) to be consistent and also to have zero approximation error is that fully contains . To make the dependence on the score subspace explicit, we parameterize it with a matrix with the number of columns typically being much smaller than the number of labels . With this notation, we have , and the dimensionality of equals the rank of , which is at most .666Evaluating requires computing and for which direct computation is intractable when is exponential, but which can be done in closed form for the structured losses we consider (the Hamming and block 0-1 loss). More generally, these operations require suitable inference algorithms. See also App. F.
For the quadratic surrogate (12), the excess of the expected surrogate takes a simple form:
Equation (13) holds under the assumption that the subspace contains the column space of the loss matrix , which also means that the set contains the optimal prediction for any (see Lemma 9 in App. B for the proof). Importantly, the function is jointly convex in the conditional probability and parameters , which simplifies its analysis.
Lower bound on the calibration function. We now present our main technical result: a lower bound on the calibration function for the surrogate loss (12). This lower bound characterizes the easiness of learning with this surrogate given the scaling intuition mentioned in Section 3.3. The proof of Theorem 7 is given in App. D.1.
Theorem 7 (Lower bound on ).
For any task loss , its quadratic surrogate , and a score subspace containing the column space of , the calibration function can be lower bounded:
where is the orthogonal projection on the subspace and with being the -th basis vector of the standard basis in .
Lower bound for specific losses. We now discuss the meaning of the bound (14) for some specific losses (the detailed derivations are given in App. D.3). For the 0-1, block 0-1 and Hamming losses (, and , respectively) with the smallest possible score subspaces , the bound (14) gives , and , respectively. All these bounds are tight (see App. E). However, if the bound (14) is not tight for the block 0-1 and mixed losses (see also App. E). In particular, the bound (14) cannot detect level- consistency for (see Def. 3) and does not change when the loss changes, but the score subspace stays the same.
Upper bound on the calibration function. Theorem 8 below gives an upper bound on the calibration function holding for unconstrained scores, i.e, (see the proof in App. D.2). This result shows that without some appropriate constraints on the scores, efficient learning is not guaranteed (in the worst case) because of the scaling of the calibration function.
Theorem 8 (Upper bound on ).
If a loss matrix with defines a pseudometric777A pseudometric is a function satisfying the following axioms: , (but possibly for some ), , . on labels and there are no constraints on the scores, i.e., , then the calibration function for the quadratic surrogate can be upper bounded:
From our lower bound in Theorem 7 (which guarantees consistency), the natural constraint on the score is , with the dimension of this space giving an indication of the intrinsic “difficulty” of a loss. Computations for the lower bounds in some specific cases (see App. D.3 for details) show that the 0-1 loss is “hard” while the block 0-1 loss and the Hamming loss are “easy”. Note that in all these cases the lower bound (14) is tight, see the discussion below.
where is the matrix defining a loss. For completeness, in App. E, we compute the exact calibration functions for the 0-1 and block 0-1 losses. Note that the calibration function for the 0-1 loss equals the lower bound, illustrating the worst-case scenario. To get some intuition, an example of a conditional distribution that gives the (worst case) value to the calibration function (for several losses) is , and for . See the proof of Proposition 12 in App. E.1.
In what follows, we provide the calibration functions in the cases with constraints on the scores. For the block 0-1 loss with equal blocks and under constraints that the scores within blocks are equal, the calibration function equals (see Proposition 14 of App. E.2)
The calibration functions (16) and (17) depend on the quantities representing the actual complexities of the loss (the number of blocks and the length of the sequence ) and can be exponentially larger than the upper bound for the unconstrained case.
In the case of mixed 0-1 and block 0-1 loss, if the scores are constrained to be equal inside the blocks, i.e., belong to the subspace , then the calibration function is equal to for , implying inconsistency (and also note that the approximation error can be as big as for ). However, for , the calibration function is of the order See Figure 1b for the illustration of this calibration function and Proposition 17 of App. E.4 for the exact formulation and the proof. Note that while the calibration function for the constrained case is inconsistent, its value can be exponentially larger than the one for the unconstrained case for big enough and when the blocks are exponentially large (see Proposition 16 of App. E.4).
Computation of the SGD constants. Applying the learning complexity Theorem 6 requires to compute the quantity where bounds the norm of the optimal solution and bounds the expected square of the norm of the stochastic gradient. In App. F, we provide a way to bound this quantity for our quadratic surrogate (12) under the simplifying assumption that each conditional (seen as function of ) belongs to the RKHS (which implies Assumption 4). In particular, we get
where is the condition number of the matrix , is an upper bound on the RKHS norm of object feature maps . We define as an upper bound on (can be seen as the generalization of the inequality for probabilities). The constants and depend on the data, the constant depends on the loss, and depend on the choice of matrix .
We compute the constant for the specific losses that we considered in App. F.1. For the 0-1, block 0-1 and Hamming losses, we have , and , respectively. These computations indicate that the quadratic surrogate allows efficient learning for structured block 0-1 and Hamming losses, but that the convergence could be slow in the worst case for the 0-1 loss.
5 Related works
Consistency for multi-class problems. Building on significant progress for the case of binary classification, see, e.g. Bartlett et al. (2006), there has been a lot of interest in the multi-class case. Zhang (2004a) and Tewari & Bartlett (2007) analyze the consistency of many existing surrogates for the 0-1 loss. Gao & Zhou (2011) focus on multi-label classification. Narasimhan et al. (2015) provide a consistent algorithm for arbitrary multi-class loss defined by a function of the confusion matrix. Recently, Ramaswamy & Agarwal (2016) introduce the notion of convex calibrated dimension, as the minimal dimensionality of the score vector that is required for consistency. In particular, they showed that for the Hamming loss on binary variables, this dimension is at most . In our analysis, we use scores of rank , see (35) in App. D.3, yielding a similar result.
The task of ranking has attracted a lot of attention and Duchi et al. (2010); Buffoni et al. (2011); Calauzènes et al. (2012); Ramaswamy et al. (2013) analyze different families of surrogate and task losses proving their (in-)consistency. In this line of work, Ramaswamy et al. (2013) propose a quadratic surrogate for an arbitrary low rank loss which is related to our quadratic surrogate (12). They also prove that several important ranking losses, i.e., precision@q, expected rank utility, mean average precision and pairwise disagreement, are of low-rank. We conjecture that our approach is compatible with these losses and leave precise connections as future work.
Structured SVM (SSVM) and friends. SSVM (Taskar et al., 2003, 2005; Tsochantaridis et al., 2005) is one of the most used convex surrogates for tasks with structured outputs, thus, its consistency has been a question of great interest. It is known that Crammer-Singer multi-class SVM (Crammer & Singer, 2001), which SSVM is built on, is not consistent for 0-1 loss unless there is a majority class with probability at least (Zhang, 2004a; McAllester, 2007). However, it is consistent for the “abstain” and ordinal losses in the case of classes Ramaswamy & Agarwal (2016). Structured ramp loss and probit surrogates are closely related to SSVM and are consistent (McAllester, 2007; Do et al., 2009; McAllester & Keshet, 2011; Keshet, 2014), but not convex.
Recently, Doğan et al. (2016) categorized different versions of multi-class SVM and analyzed them from Fisher and universal consistency point of views. In particular, they highlight differences between Fisher and universal consistency and give examples of surrogates that are Fisher consistent, but not universally consistent and vice versa. They also highlight that the Crammer-Singer SVM is neither Fisher, not universally consistent even with a careful choice of regularizer.
Quadratic surrogates for structured prediction. Ciliberto et al. (2016) and Brouard et al. (2016) consider minimizing aiming to match the RKHS embedding of inputs to the feature maps of outputs . In their frameworks, the task loss is not considered at the learning stage, but only at the prediction stage. Our quadratic surrogate (12) depends on the loss directly. The empirical risk defined by both their and our objectives can be minimized analytically with the help of the kernel trick and, moreover, the resulting predictors are identical. However, performing such computation in the case of large dataset can be intractable and the generalization properties have to be taken care of, e.g., by the means of regularization. In the large-scale scenario, it is more natural to apply stochastic optimization (e.g., kernel ASGD) that directly minimizes the population risk and has better dependency on the dataset size. When combined with stochastic optimization, the two approaches lead to different behavior. In our framework, we need to estimate scalar functions, but the alternative needs to estimate functions (if, e.g., ), which results in significant differences for low-rank losses, such as block 0-1 and Hamming.
Calibration functions. Bartlett et al. (2006) and Steinwart (2007) provide calibration functions for most existing surrogates for binary classification. All these functions differ in term of shape, but are roughly similar in terms of constants. Pedregosa et al. (2017) generalize these results to the case of ordinal regression. However, their calibration functions have at best a factor if the surrogate is normalized w.r.t. the number of classes. The task of ranking has been of significant interest. However, most of the literature (e.g., Clémençon et al., 2008; Cossock & Zhang, 2008; Kotlowski et al., 2011; Agarwal, 2014), only focuses on calibration functions (in the form of regret bounds) for bipartite ranking, which is more akin to cost-sensitive binary classification.
Ávila Pires et al. (2013) generalize the theoretical framework developed by Steinwart (2007) and present results for the multi-class SVM of Lee et al. (2004) (the score vectors are constrained to sum to zero) that can be built for any task loss of interest. Their surrogate is of the form where and is some convex function with all subgradients at zero being positive. The recent work by Ávila Pires & Szepesvári (2016) refines the results, but specifically for the case of 0-1 loss. In this line of work, the surrogate is typically not normalized by , and if normalized the calibration functions have the constant appearing.
Finally, Ciliberto et al. (2016) provide the calibration function for their quadratic surrogate. Assuming that the loss can be represented as , (this assumption can always be satisfied in the case of a finite number of labels, by taking as the loss matrix and where is the -th vector of the standard basis in ). In their Theorem 2, they provide an excess risk bound leading to a lower bound on the corresponding calibration function where a constant simply equals the spectral norm of the loss matrix for the finite-dimensional construction provided above. However, the spectral norm of the loss matrix is exponentially large even for highly structured losses such as the block 0-1 and Hamming losses, i.e., , . This conclusion puts the objective of Ciliberto et al. (2016) in line with ours when no constraints are put on the scores.
In this paper, we studied the consistency of convex surrogate losses specifically in the context of structured prediction. We analyzed calibration functions and proposed an optimization-based normalization aiming to connect consistency with the existence of efficient learning algorithms. Finally, we instantiated all components of our framework for several losses by computing the calibration functions and the constants coming from the normalization. By carefully monitoring exponential constants, we highlighted the difference between tractable and intractable task losses.
These were first steps in advancing our theoretical understanding of consistent structured prediction. Further steps include analyzing more losses such as the low-rank ranking losses studied by Ramaswamy et al. (2013) and, instead of considering constraints on the scores, one could instead put constraints on the set of distributions to investigate the effect on the calibration function.
We would like to thank Pascal Germain for useful discussions. This work was partly supported by the ERC grant Activia (no. 307574), the NSERC Discovery Grant RGPIN-2017-06936 and the MSR-INRIA Joint Center.
- Agarwal (2014) Agarwal, Shivani. Surrogate regret bounds for bipartite ranking via strongly proper losses. Journal of Machine Learning Research (JMLR), 15(1):1653–1674, 2014.
- Ávila Pires & Szepesvári (2016) Ávila Pires, Bernardo and Szepesvári, Csaba. Multiclass classification calibration functions. arXiv, 1609.06385v1, 2016.
- Ávila Pires et al. (2013) Ávila Pires, Bernardo, Ghavamzadeh, Mohammad, and Szepesvári, Csaba. Cost-sensitive multiclass classification risk bounds. In ICML, 2013.
- Bakir et al. (2007) Bakir, Gökhan, Hofmann, Thomas, Schölkopf, Bernhard, Smola, Alexander J., Taskar, Ben, and Vishwanathan, S.V.N. Predicting Structured Data. MIT press, 2007.
- Bartlett et al. (2006) Bartlett, Peter L., Jordan, Michael I., and McAuliffe, Jon D. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
- Bousquet & Bottou (2008) Bousquet, Olivier and Bottou, Léon. The tradeoffs of large scale learning. In NIPS, 2008.
- Brouard et al. (2016) Brouard, Céline, Szafranski, Marie, and d’Alché-Buc, Florence. Input output kernel regression: Supervised and semi-supervised structured output prediction with operator-valued kernels. Journal of Machine Learning Research (JMLR), 17(176):1–48, 2016.
- Buffoni et al. (2011) Buffoni, David, Gallinari, Patrick, Usunier, Nicolas, and Calauzènes, Clément. Learning scoring functions with order-preserving losses and standardized supervision. In ICML, 2011.
- Calauzènes et al. (2012) Calauzènes, Clément, Usunier, Nicolas, and Gallinari, Patrick. On the (non-)existence of convex, calibrated surrogate losses for ranking. In NIPS, 2012.
- Ciliberto et al. (2016) Ciliberto, Carlo, Rosasco, Lorenzo, and Rudi, Alessandro. A consistent regularization approach for structured prediction. In NIPS, 2016.
- Clémençon et al. (2008) Clémençon, Stéphan, Lugosi, Gábor, and Vayatis, Nicolas. Ranking and empirical minimization of U-statistics. The Annals of Statistics, pp. 844–874, 2008.
- Collins (2002) Collins, Michael. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In EMNLP, 2002.
- Cortes et al. (2016) Cortes, Corinna, Kuznetsov, Vitaly, Mohri, Mehryar, and Yang, Scott. Structured prediction theory based on factor graph complexity. In NIPS, 2016.
- Cossock & Zhang (2008) Cossock, David and Zhang, Tong. Statistical analysis of bayes optimal subset ranking. IEEE Transactions on Information Theory, 54(11):5140–5154, 2008.
- Crammer & Singer (2001) Crammer, Koby and Singer, Yoram. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research (JMLR), 2:265–292, 2001.
- Do et al. (2009) Do, Chuong B., Le, Quoc, Teo, Choon Hui, Chapelle, Olivier, and Smola, Alex. Tighter bounds for structured estimation. In NIPS, 2009.
- Doğan et al. (2016) Doğan, Ürün, Glasmachers, Tobias, and Igel, Christian. A unified view on multi-class support vector classification. Journal of Machine Learning Research (JMLR), 17(45):1–32, 2016.
- Duchi et al. (2010) Duchi, John C., Mackey, Lester W., and Jordan, Michael I. On the consistency of ranking algorithms. In ICML, 2010.
- Durbin et al. (1998) Durbin, Richard, Eddy, Sean, Krogh, Anders, and Mitchison, Graeme. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge university press, 1998.
- Gao & Zhou (2011) Gao, Wei and Zhou, Zhi-Hua. On the consistency of multi-label learning. In COLT, 2011.
- Gimpel & Smith (2010) Gimpel, Kevin and Smith, Noah A. Softmax-margin CRFs: Training loglinear models with cost functions. In NAACL, 2010.
- Hazan & Urtasun (2010) Hazan, Tamir and Urtasun, Raquel. A primal-dual message-passing algorithm for approximated large scale structured prediction. In NIPS, 2010.
- Keshet (2014) Keshet, Joseph. Optimizing the measure of performance in structured prediction. In Advanced Structured Prediction. MIT Press, 2014.
- Kotlowski et al. (2011) Kotlowski, Wojciech, Dembczynski, Krzysztof, and Huellermeier, Eyke. Bipartite ranking through minimization of univariate loss. In ICML, 2011.
- Lafferty et al. (2001) Lafferty, John, McCallum, Andrew, and Pereira, Fernando. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, 2001.
- Lee et al. (2004) Lee, Yoonkyung, Lin, Yi, and Wahba, Grace. Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association, 99(465):67–81, 2004.
- Lin (2004) Lin, Yi. A note on margin-based loss functions in classification. Statistics & Probability Letters, 68(1):73–82, 2004.
- London et al. (2016) London, Ben, Huang, Bert, and Getoor, Lise. Stability and generalization in structured prediction. Journal of Machine Learning Research (JMLR), 17(222):1–52, 2016.
- Long & Servedio (2013) Long, Phil and Servedio, Rocco. Consistency versus realizable H-consistency for multiclass classification. In ICML, 2013.
- McAllester & Keshet (2011) McAllester, D. A. and Keshet, J. Generalization bounds and consistency for latent structural probit and ramp loss. In NIPS, 2011.
- McAllester (2007) McAllester, David. Generalization bounds and consistency for structured labeling. In Predicting Structured Data. MIT Press, 2007.
- Narasimhan et al. (2015) Narasimhan, Harikrishna, Ramaswamy, Harish G., Saha, Aadirupa, and Agarwal, Shivani. Consistent multiclass algorithms for complex performance measures. In ICML, 2015.
- Nemirovski et al. (2009) Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
- Nowozin & Lampert (2011) Nowozin, Sebastian and Lampert, Christoph H. Structured learning and prediction in computer vision. Foundations and Trends in Computer Graphics and Vision, 6(3–4):185–365, 2011.
- Nowozin et al. (2014) Nowozin, Sebastian, Gehler, Peter V., Jancsary, Jeremy, and Lampert, Christoph H. Advanced Structured Prediction. MIT Press, 2014.
- Orabona (2014) Orabona, Francesco. Simultaneous model selection and optimization through parameter-free stochastic learning. In NIPS, 2014.
- Pedregosa et al. (2017) Pedregosa, Fabian, Bach, Francis, and Gramfort, Alexandre. On the consistency of ordinal regression methods. Journal of Machine Learning Research (JMLR), 18(55):1–35, 2017.
- Pletscher et al. (2010) Pletscher, Patrick, Ong, Cheng Soon, and Buhmann, Joachim M. Entropy and margin maximization for structured output learning. In ECML PKDD, 2010.
- Ramaswamy & Agarwal (2016) Ramaswamy, Harish G. and Agarwal, Shivani. Convex calibration dimension for multiclass loss matrices. Journal of Machine Learning Research (JMLR), 17(14):1–45, 2016.
- Ramaswamy et al. (2013) Ramaswamy, Harish G., Agarwal, Shivani, and Tewari, Ambuj. Convex calibrated surrogates for low-rank loss matrices with applications to subset ranking losses. In NIPS, 2013.
- Shi et al. (2015) Shi, Qinfeng, Reid, Mark, Caetano, Tiberio, van den Hengel, Anton, and Wang, Zhenhua. A hybrid loss for multiclass and structured prediction. IEEE transactions on pattern analysis and machine intelligence (TPAMI), 37(1):2–12, 2015.
- Smith (2011) Smith, Noah A. Linguistic structure prediction. Synthesis lectures on human language technologies, 4(2):1–274, 2011.
- Steinwart (2007) Steinwart, Ingo. How to compare different loss functions and their risks. Constructive Approximation, 26(2):225–287, 2007.
- Taskar et al. (2003) Taskar, Ben, Guestrin, Carlos, and Koller, Daphne. Max-margin markov networks. In NIPS, 2003.
- Taskar et al. (2005) Taskar, Ben, Chatalbashev, Vassil, Koller, Daphne, and Guestrin, Carlos. Learning structured prediction models: a large margin approach. In ICML, 2005.
- Tewari & Bartlett (2007) Tewari, Ambuj and Bartlett, Peter L. On the consistency of multiclass classification methods. Journal of Machine Learning Research (JMLR), 8:1007–1025, 2007.
- Tsochantaridis et al. (2005) Tsochantaridis, I., Joachims, T., Hofmann, T., and Altun, Y. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research (JMLR), 6:1453–1484, 2005.
- Williamson et al. (2016) Williamson, Robert C., Vernet, Elodie, and Reid, Mark D. Composite multiclass losses. Journal of Machine Learning Research (JMLR), 17(223):1–52, 2016.
- Zhang (2004a) Zhang, Tong. Statistical analysis of some multi-category large margin classification methods. Journal of Machine Learning Research (JMLR), 5:1225–1251, 2004a.
- Zhang (2004b) Zhang, Tong. Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics, 32(1):56–134, 2004b.
Supplementary Material (Appendix)
[4mm] On Structured Prediction Theory with Calibrated
Convex Surrogate Losses
- Section A:
Proof of learning complexity Theorem 6.
- Section B:
Technical lemmas useful for the proofs.
- Section C:
Discussion and consistency results on a family of surrogate losses.
- Section D:
- Section E:
- Section F:
Computing constants appearing in the SGD rate.
- Section G:
Properties of the basis of the Hamming loss.
Appendix A Learning complexity theorem
Theorem 6 (Learning complexity).
Under the assumptions of Theorem 5, for any , the random (w.r.t. the observed training set) output of the ASGD algorithm after
iterations has the expected excess risk bounded with , i.e.,
By (10) from Theorem 5, steps of the algorithm, in expectation, result in accuracy on the surrogate risk, i.e., . We now generalize the proof of Theorem 2 to the case of expectation w.r.t. depending on the random samples used by the ASGD algorithm. We take the expectation of (4) w.r.t. substituted as and use Jensen’s inequality (by convexity of ) to get . Finally, monotonicity of implies . ∎
Appendix B Technical lemmas
In this section, we prove two technical lemmas that simplify the proofs of the main theoretical claims of the paper.
Lemma 9 computes the excess of the weighted surrogate risk for the quadratic loss (12), which is central to our analysis presented in Section 4. The key property of this result is that the excess is jointly convex w.r.t. the parameters and conditional distribution , which simplifies further analysis.
Lemma 10 allows to cope with the combinatorial aspect of the computation of the calibration function. In particular, when the excess of the weighted surrogate risk is convex, Lemma 10 reduces the computation of the calibration function to a set of convex optimization problems, which often can be solved analytically. For symmetric losses, such as the 0-1, block 0-1 and Hamming losses, Lemma 10 also provides “symmetry breaking”, meaning that many of the obtained convex optimization problems are identical up to a permutation of labels.
Consider the quadratic surrogate (12) defined for a task loss . Let a subspace of scores be parametrized by , i.e., with , and assume that . Then, the excess of the weighted surrogate loss can be expressed as
By using the definition of the quadratic surrogate (12), we have
where denotes the quantity independent of parameters . Note that is the orthogonal projection on the subspace , so if we have which finishes the proof. ∎
In the case of a finite number of labels, for any task loss , a surrogate loss that is continuous and bounded from below, and a set of scores , the calibration function can be written as
where the set is defined as the set of labels that the predictor can predict for some feasible scores and is defined via minimization of the same objective as (5), but w.r.t. a smaller domain:
Here is the expected loss if predicting label . Index represents a label with the smallest expected loss while index represents a label with the largest score.
We use the notation to define the set of score vectors where the predictor takes a value , i.e., . The union of the sets , , equals the whole set . It is possible that sets do not fully contain their boundary because of the usage of a particular tie-breaking strategy, but their closure can be expressed as .
If , i.e. , then the feasible set of probability vectors for which a label is one of the best possible predictions (i.e. ) is
The union of the sets thus exactly equals the feasibility set of the optimization problem (5)-(6) (note that this is not true for the union of the sets , which can be strictly larger), thus we can rewrite the definition of the calibration function as follows:
To finish the proof, we use Lemma 27 of (Zhang, 2004a) claiming that the function is continuous w.r.t. both and , which allows us to substitute sets in (22) with their closures without changing the value of the infimum. ∎
Appendix C Consistent surrogate losses
An ideal surrogate should not only be consistent, but also allow efficient optimization, by, e.g., being convex and allowing fast computation of stochastic gradients. In this paper, we study a generalization to arbitrary multi-class losses of a surrogate loss class from Zhang (2004a, Section 4.4.2)888Zhang (2004a) refers to this surrogate as “decoupled unconstrained background discriminative surrogate”. Note the scaling to make of order . that satisfies these requirements:
where are convex functions. A generic method to minimize this surrogate is to use any version of the SGD algorithm, while computing the stochastic gradient by sampling from the data generating distribution and a label uniformly. In the case of the quadratic surrogate , we proposed instead in the main paper to compute the sum over analytically instead of sampling .
Theorem 11 (Sufficient conditions for consistency).
The surrogate loss is consistent w.r.t. a task loss , i.e., for any , under the following conditions on the functions and :
The functions and are convex and differentiable.
The function is bounded from below and has a unique global minimizer (finite or infinite) for all .
The functions and are strictly increasing.
Consider an arbitrary conditional probability vector . Assumption 2 then implies that the global minimizer of the conditional surrogate risk w.r.t. is unique. Assumption 1 allows us to set the derivatives to zero and obtain where . Assumption 3 then implies that holds if and only if .
Now, we will prove by contradiction that for any . Assume that for some we have . Lemma 10 then implies that for some , , we have . Note that the domain of (21) defining is separable w.r.t. and . We can now rewrite (21) as