A Smoother Way to Train Structured Prediction Models
We present a framework to train a structured prediction model by performing smoothing on the inference algorithm it builds upon. Smoothing overcomes the non-smoothness inherent to the maximum margin structured prediction objective, and paves the way for the use of fast primal gradient-based optimization algorithms. We illustrate the proposed framework by developing a novel primal incremental optimization algorithm for the structural support vector machine. The proposed algorithm blends an extrapolation scheme for acceleration and an adaptive smoothing scheme and builds upon the stochastic variance-reduced gradient algorithm. We establish its worst-case global complexity bound and study several practical variants, including extensions to deep structured prediction. We present experimental results on two real-world problems, namely named entity recognition and visual object localization. The experimental results show that the proposed framework allows us to build upon efficient inference algorithms to develop large-scale optimization algorithms for structured prediction which can achieve competitive performance on the two real-world problems.
Consider the optimization problem arising when training maximum margin structured prediction models:
where each is the structural hinge loss. Max-margin structured prediction was designed to forecast discrete data structures such as sequences and trees (Taskar et al., 2004; Tsochantaridis et al., 2004).
Batch non-smooth optimization algorithms such as cutting plane methods are appropriate for problems with small or moderate sample sizes (Tsochantaridis et al., 2004; Joachims et al., 2009). Stochastic non-smooth optimization algorithms such as stochastic subgradient methods can tackle problems with large sample sizes (Ratliff et al., 2007; Shalev-Shwartz et al., 2011). However, both families of methods achieve the typical worst-case complexity bounds of non-smooth optimization algorithms and cannot easily leverage a possible hidden smoothness of the objective.
Furthermore, as significant progress is being made on incremental smooth optimization algorithms for training unstructured prediction models (Lin et al., 2018), we would like to transfer such advances and design faster optimization algorithms to train structured prediction models. Indeed if each term in the finite-sum were -smooth, incremental optimization algorithms such as MISO (Mairal, 2015), SAG (Le Roux et al., 2012; Schmidt et al., 2017), SAGA (Defazio et al., 2014), SDCA (Shalev-Shwartz and Zhang, 2013), and SVRG (Johnson and Zhang, 2013) could leverage the finite-sum structure of the objective (1) and achieve faster convergence than batch algorithms on large-scale problems.
Incremental optimization algorithms can be further accelerated, either on a case-by-case basis (Shalev-Shwartz and Zhang, 2014; Frostig et al., 2015; Allen-Zhu, 2017; Defazio, 2016) or using the Catalyst acceleration scheme (Lin et al., 2015, 2018), to achieve near-optimal convergence rates (Woodworth and Srebro, 2016). Accelerated incremental optimization algorithms demonstrate stable and fast convergence behavior on a wide range of problems, in particular for ill-conditioned ones.
We introduce a general framework that allows us to bring the power of accelerated incremental optimization algorithms to the realm of structured prediction problems. To illustrate our framework, we focus on the problem of training a structural support vector machine (SSVM), and extend the developed algorithms to deep structured prediction models with nonlinear mappings.
We seek primal optimization algorithms, as opposed to saddle-point or primal-dual optimization algorithms, in order to be able to tackle structured prediction models with affine mappings such as SSVM as well as deep structured prediction models with nonlinear mappings. We show how to shade off the inherent non-smoothness of the objective while still being able to rely on efficient inference algorithms.
- Smooth Inference Oracles.
We introduce a notion of smooth inference oracles that gracefully fits the framework of black-box first-order optimization. While the exp inference oracle reveals the relationship between max-margin and probabilistic structured prediction models, the top- inference oracle can be efficiently computed using simple modifications of efficient inference algorithms in many cases of interest.
- Incremental Optimization Algorithms.
We present a new algorithm built on top of SVRG, blending an extrapolation scheme for acceleration and an adaptive smoothing scheme. We establish the worst-case complexity bounds of the proposed algorithm and extend it to the case of non-linear mappings. Finally, we demonstrate its effectiveness compared to competing algorithms on two tasks, namely named entity recognition and visual object localization.
The code is publicly available as a software library called Casimir111https://github.com/krishnap25/casimir. The outline of the paper is as follows: Sec. 1.1 reviews related work. Sec. 2 discusses smoothing for structured prediction followed by Sec. 3, which defines and studies the properties of inference oracles and Sec. 4, which describes the concrete implementation of these inference oracles in several settings of interest. Then, we switch gears to study accelerated incremental algorithms in convex case (Sec. 5) and their extensions to deep structured prediction (Sec. 6). Finally, we evaluate the proposed algorithms on two tasks, namely named entity recognition and visual object localization in Sec. 7.
1.1 Related Work
Optimization for Structural Support Vector Machines
Table 1 gives an overview of different optimization algorithms designed for structural support vector machines. Early works (Taskar et al., 2004; Tsochantaridis et al., 2004; Joachims et al., 2009; Teo et al., 2009) considered batch dual quadratic optimization (QP) algorithms. The stochastic subgradient method operated directly on the non-smooth primal formulation (Ratliff et al., 2007; Shalev-Shwartz et al., 2011). More recently, Lacoste-Julien et al. (2013) proposed a block coordinate Frank-Wolfe (BCFW) algorithm to optimize the dual formulation of structural support vector machines; see also Osokin et al. (2016) for variants and extensions. Saddle-point or primal-dual approaches include the mirror-prox algorithm (Taskar et al., 2006; Cox et al., 2014; He and Harchaoui, 2015). Palaniappan and Bach (2016) propose an incremental optimization algorithm for saddle-point problems. However, it is unclear how to extend it to the structured prediction problems considered here. Incremental optimization algorithms for conditional random fields were proposed by Schmidt et al. (2015). We focus here on primal optimization algorithms in order to be able to train structured prediction models with affine or nonlinear mappings with a unified approach, and on incremental optimization algorithms which can scale to large datasets.
The ideas of dynamic programming inference in tree structured graphical models have been around since the pioneering works of Pearl (1988) and Dawid (1992). Other techniques emerged based on graph cuts (Greig et al., 1989; Ishikawa and Geiger, 1998), bipartite matchings (Cheng et al., 1996; Taskar et al., 2005) and search algorithms (Daumé III and Marcu, 2005; Lampert et al., 2008; Lewis and Steedman, 2014; He et al., 2017). For graphical models that admit no such a discrete structure, techniques based on loopy belief propagation (McEliece et al., 1998; Murphy et al., 1999), linear programming (LP) (Schlesinger, 1976), dual decomposition (Johnson, 2008) and variational inference (Wainwright et al., 2005; Wainwright and Jordan, 2008) gained popularity.
Smooth inference oracles with smoothing echo older heuristics in speech and language processing (Jurafsky et al., 2014). Combinatorial algorithms for top- inference have been studied extensively by the graphical models community under the name “-best MAP”. Seroussi and Golmard (1994) and Nilsson (1998) first considered the problem of finding the most probable configurations in a tree structured graphical model. Later, Yanover and Weiss (2004) presented the Best Max-Marginal First algorithm which solves this problem with access only to an oracle that computes max-marginals. We also use this algorithm in Sec. 4.2. Fromer and Globerson (2009) study top- inference for LP relaxation, while Batra (2012) considers the dual problem to exploit graph structure. Flerova et al. (2016) study top- extensions of the popular and branch and bound search algorithms in the context of graphical models. Other related approaches include diverse -best solutions (Batra et al., 2012) and finding -most probable modes (Chen et al., 2013).
Smoothing for inference was used to speed up iterative algorithms for continuous relaxations. Johnson (2008) considered smoothing dual decomposition inference using the entropy smoother, followed by Jojic et al. (2010) and Savchynskyy et al. (2011) who studied its theoretical properties. Meshi et al. (2012) expand on this study to include smoothing. Explicitly smoothing discrete inference algorithms in order to smooth the learning problem was considered by Zhang et al. (2014) and Song et al. (2014) using the entropy and smoothers respectively. The smoother was also used by Martins and Astudillo (2016). Hazan et al. (2016) consider the approach of blending learning and inference, instead of using inference algorithms as black-box procedures.
Related ideas to ours appear in the independent works (Mensch and Blondel, 2018; Niculae et al., 2018). These works partially overlap with ours, but the papers choose different perspectives, making them complementary to each other. Mensch and Blondel (2018) proceed differently when, e.g., smoothing inference based on dynamic programming. Moreover, they do not establish complexity bounds for optimization algorithms making calls to the resulting smooth inference oracles. We define smooth inference oracles in the context of black-box first-order optimization and establish worst-case complexity bounds for incremental optimization algorithms making calls to these oracles. Indeed we relate the amount of smoothing controlled by to the resulting complexity of the optimization algorithms relying on smooth inference oracles.
End-to-end Training of Structured Prediction
The general framework for global training of structured prediction models was introduced by Bottou and Gallinari (1990) and applied to handwriting recognition by Bengio et al. (1995) and to document processing by Bottou et al. (1997). This approach, now called “deep structured prediction”, was used, e.g., by Collobert et al. (2011) and Belanger and McCallum (2016).
Vectors are denoted by bold lowercase characters as while matrices are denoted by bold uppercase characters as . For a matrix , define the norm for ,
For any function , its convex conjugate is defined as
A function is said to be -smooth with respect to an arbitrary norm if it is continuously differentiable and its gradient is -Lipschitz with respect to . When left unspecified, refers to . Given a continuously differentiable map , its Jacobian at is defined so that its th entry is where is the th element of and is the th element of . The vector valued function is said to be -smooth with respect to if it is continuously differentiable and its Jacobian is -Lipschitz with respect to .
For a vector , refer to its components enumerated in non-increasing order where ties are broken arbitrarily. Further, we let denote the vector of the largest components of . We denote by the standard probability simplex in . When the dimension is clear from the context, we shall simply denote it by . Moreover, for a positive integer , refers to the set . Lastly, in the big- notation hides factors logarithmic in problem parameters.
2 Smooth Structured Prediction
Structured prediction aims to search for score functions parameterized by that model the compatibility of input and output as through a graphical model. Given a score function , predictions are made using an inference procedure which, when given an input , produces the best output
We shall return to the score functions and the inference procedures in Sec. 3. First, given such a score function , we define the structural hinge loss and describe how it can be smoothed.
2.1 Structural Hinge Loss
On a given input-output pair , the error of prediction of by the inference procedure with a score function , is measured by a task loss such as the Hamming loss. The learning procedure would then aim to find the best parameter that minimizes the loss on a given dataset of input-output training examples. However, the resulting problem is piecewise constant and hard to optimize. Instead, Altun et al. (2003); Taskar et al. (2004); Tsochantaridis et al. (2004) propose to minimize a majorizing surrogate of the task loss, called the structural hinge loss defined on an input-output pair as
where is the augmented score function.
This approach, known as max-margin structured prediction, builds upon binary and multi-class support vector machines (Crammer and Singer, 2001), where the term inside the maximization in (4) generalizes the notion of margin. The task loss is assumed to possess appropriate structure so that the maximization inside (4), known as loss augmented inference, is no harder than the inference problem in (3). When considering a fixed input-output pair ), we drop the index with respect to the sample and consider the structural hinge loss as
When the map is affine, the structural hinge loss and the objective from (1) are both convex - we refer to this case as the structural support vector machine. When is a nonlinear but smooth map, then the structural hinge loss and the objective are nonconvex.
2.2 Smoothing Strategy
A convex, non-smooth function can be smoothed by taking its infimal convolution with a smooth function (Beck and Teboulle, 2012). We now recall its dual representation, which Nesterov (2005b) first used to relate the amount of smoothing to optimal complexity bounds.
For a given convex function , a smoothing function which is 1-strongly convex with respect to (for ), and a parameter , define
as the smoothing of by .
We now state a classical result showing how the parameter controls both the approximation error and the level of the smoothing. For a proof, see Beck and Teboulle (2012, Thm. 4.1, Lemma 4.2) or Prop. 39 of Appendix A.
Consider the setting of Def. 1. The smoothing is continuously differentiable and its gradient, given by
is -Lipschitz with respect to . Moreover, letting for , the smoothing satisfies, for all ,
Smoothing the Structural Hinge Loss
We rewrite the structural hinge loss as a composition
where so that the structural hinge loss reads
We smooth the structural hinge loss (7) by simply smoothing the non-smooth max function as
When is smooth and Lipschitz continuous, is a smooth approximation of the structural hinge loss, whose gradient is readily given by the chain-rule. In particular, when is an affine map , if follows that is -smooth with respect to (cf. Lemma 40 in Appendix A). Furthermore, for , we have,
2.3 Smoothing Variants
In the context of smoothing the max function, we now describe two popular choices for the smoothing function , followed by computational considerations.
2.3.1 Entropy and smoothing
When is the max function, the smoothing operation can be computed analytically for the entropy smoother and the smoother, denoted respectively as
These lead respectively to the log-sum-exp function (Nesterov, 2005b, Lemma 4)
and an orthogonal projection onto the simplex,
Furthermore, the following holds for all from Prop. 2:
2.3.2 Top- Strategy
Though the gradient of the composition can be written using the chain rule, its actual computation for structured prediction problems involves computing over all of its components, which may be intractable. However, in the case of smoothing, projections onto the simplex are sparse, as pointed out by the following proposition.
Consider the Euclidean projection of onto the simplex, where . The projection has exactly non-zeros if and only if
where are the components of in non-decreasing order and . In this case, is given by
The projection satisfies , where is the unique solution of in the equation
where . See, e.g., Held et al. (1974) for a proof of this fact. Note that implies that for all . Therefore has non-zeros if and only if and .
Now suppose that has exactly non-zeros, we can then solve (9) to obtain , which is defined as
Plugging in the value of in gives . Likewise, gives .
Thus, the projection of onto the simplex picks out some number of the largest entries of - we refer to this as the sparsity of . This fact motivates the top- strategy: given , fix an integer a priori and consider as surrogates for and respectively
where denotes the vector composed of the largest entries of and defines their extraction, i.e., where satisfy such that . A surrogate of the smoothing is then given by
Exactness of Top- Strategy
We say that the top- strategy is exact at for when it recovers the first order information of , i.e. when and . The next proposition outlines when this is the case. Note that if the top- strategy is exact at for a smoothing parameter then it will be exact at for any .
The top- strategy is exact at for if
Moreover, for any fixed such that the vector has at least two unique elements, the top- strategy is exact at for all satisfying .
First, we note that the top- strategy is exact when the sparsity of the projection satisfies . From Prop. 3, the condition that happens when
since the intervals in the union are contiguous. This establishes (12).
If the top- strategy is exact at for , then
where the latter follows from the chain rule. When used instead of smoothing in the algorithms presented in Sec. 5, the top- strategy provides a computationally efficient heuristic to smooth the structural hinge loss. Though we do not have theoretical guarantees using this surrogate, experiments presented in Sec. 7 show its efficiency and its robustness to the choice of .
3 Inference Oracles
This section studies first order oracles used in standard and smoothed structured prediction. We first describe the parameterization of the score functions through graphical models.
3.1 Score Functions
Structured prediction is defined by the structure of the output , while input can be arbitrary. Each output is composed of components that are linked through a graphical model - the nodes represent the components of the output while the edges define the dependencies between various components. The value of each component for represents the state of the node and takes values from a finite set . The set of all output structures is then finite yet potentially intractably large.
The structure of the graph (i.e., its edge structure) depends on the task. For the task of sequence labeling, the graph is a chain, while for the task of parsing, the graph is a tree. On the other hand, the graph used in image segmentation is a grid.
For a given input and a score function , the value measures the compatibility of the output for the input . The essential characteristic of the score function is that it decomposes over the nodes and edges of the graph as
For a fixed , each input defines a specific compatibility function . The nature of the problem and the optimization algorithms we consider hinge upon whether is an affine function of or not. The two settings studied here are the following:
Pre-defined Feature Map. In this structured prediction framework, a pre-specified feature map is employed and the score is then defined as the linear function
Learning the Feature Map. We also consider the setting where the feature map is parameterized by , for example, using a neural network, and is learned from the data. The score function can then be written as
where and the scalar product decomposes into nodes and edges as above.
Note that we only need the decomposition of the score function over nodes and edges of the as in Eq. (13). In particular, while Eq. (15) is helpful to understand the use of neural networks in structured prediction, the optimization algorithms developed in Sec. 6 apply to general nonlinear but smooth score functions.
This framework captures both generative probabilistic models such as Hidden Markov Models (HMMs) that model the joint distribution between and as well as discriminative probabilistic models, such as conditional random fields (Lafferty et al., 2001) where dependencies among the input variables do not need to be explicitly represented. In these cases, the log joint and conditional probabilities respectively play the role of the score .
Example 5 (Sequence Tagging).
Consider the task of sequence tagging in natural language processing where each is a sequence of words and is a sequence of labels, both of length . Common examples include part of speech tagging and named entity recognition. Each word in the sequence comes from a finite dictionary , and each tag in takes values from a finite set . The corresponding graph is simply a linear chain.
The score function measures the compatibility of a sequence for the input using parameters as, for instance,
where, using and as node and edge weights respectively, we define for each ,
The pairwise term is analogously defined. Here, are special “start” and “stop” symbols respectively. This can be written as a dot product of with a pre-specified feature map as in (14), by defining
where is the unit vector , is the unit vector , denotes the Kronecker product between vectors and denotes vector concatenation.
3.2 Inference Oracles
We define now inference oracles as first order oracles in structured prediction. These are used later to understand the information-based complexity of optimization algorithms.
3.2.1 First Order Oracles in Structured Prediction
A first order oracle for a function is a routine which, given a point , returns on output a value and a (sub)gradient , where is the Fréchet (or regular) subdifferential (Rockafellar and Wets, 2009, Def. 8.3). We now define inference oracles as first order oracles for the structural hinge loss and its smoothed variants . Note that these definitions are independent of the graphical structure. However, as we shall see, the graphical structure plays a crucial role in the implementation of the inference oracles.
Consider an augmented score function , a level of smoothing and the structural hinge loss . For a given ,
the max oracle returns and .
the exp oracle returns and .
the top- oracle returns and as surrogates for and respectively.
Note that the exp oracle gets its name since it can be written as an expectation over all , as revealed by the next lemma, which gives analytical expressions for the gradients returned by the oracles.
Consider the setting of Def. 6. We have the following:
For any , we have that . That is, the max oracle can be implemented by inference.
The output of the exp oracle satisfies , where
The output of the top- oracle satisfies where is the set of largest scoring outputs satisfying
Part 2 deals with the composition of differentiable functions, and follows from the chain rule. Part 3 follows from the definition in Eq. (11). The proof of Part 1 follows from the chain rule for Fréchet subdifferentials of compositions (Rockafellar and Wets, 2009, Theorem 10.6) together with the fact that by convexity and Danskin’s theorem (Bertsekas, 1999, Proposition B.25), the subdifferential of the max function is given by . ∎
Consider the task of sequence tagging from Example 5. The inference problem (3) is a search over all label sequences. For chain graphs, this is equivalent to searching for the shortest path in the associated trellis, shown in Fig. 1. An efficient dynamic programming approach called the Viterbi algorithm (Viterbi, 1967) can solve this problem in space and time polynomial in and . The structural hinge loss is non-smooth because a small change in might lead to a radical change in the best scoring path shown in Fig. 1.
When smoothing with , the smoothed function is given by a projection onto the simplex, which picks out some number of the highest scoring outputs or equivalently, shortest paths in the Viterbi trellis (Fig. 0(b)). The top- oracle then uses the top- strategy to approximate with .
3.2.2 Exp Oracles and Conditional Random Fields
Recall that a Conditional Random Field (CRF) (Lafferty et al., 2001) with augmented score function and parameters is a probabilistic model that assigns to output the probability
where is known as the log-partition function, a normalizer so that the probabilities sum to one. Gradient-based maximum likelihood learning algorithms for CRFs require computation of the log-partition function and its gradient . Next proposition relates the computational costs of the exp oracle and the log-partition function.
The exp oracle for an augmented score function with parameters is equivalent in hardness to computing the log-partition function and its gradient for a conditional random field with augmented score function .
Fix a smoothing parameter . Consider a CRF with augmented score function . Its log-partition function satisfies . The claim now follows from the bijection between and . ∎
4 Implementation of Inference Oracles
We now turn to the concrete implementation of the inference oracles. This depends crucially on the structure of the graph . If the graph is a tree, then the inference oracles can be computed exactly with efficient procedures, as we shall see in in the Sec. 4.1. When the graph is not a tree, we study special cases when specific discrete structure can be exploited to efficiently implement some of the inference oracles in Sec. 4.2. The results of this section are summarized in Table 2.
|Max oracle||Top- oracle||Exp oracle|
Throughout this section, we fix an input-output pair and consider the augmented score function it defines, where the index of the sample is dropped by convenience. From (13) and the decomposability of the loss, we get that decomposes along nodes and edges of as:
When is clear from the context, we denote by . Likewise for and .
4.1 Inference Oracles in Trees
We first consider algorithms implementing the inference algorithms in trees and examine their computational complexity.
4.1.1 Implementation of Inference Oracles
In tree structured graphical models, the inference problem (3), and thus the max oracle (cf. Lemma 71) can always be solved exactly in polynomial time by the max-product algorithm (Pearl, 1988), which uses the technique of dynamic programming (Bellman, 1957). The Viterbi algorithm (Algo. 1) for chain graphs from Example 8 is a special case. See Algo. 7 in Appendix B for the max-product algorithm in full generality.
The top- oracle uses a generalization of the max-product algorithm that we name top- max-product algorithm. Following the work of Seroussi and Golmard (1994), it keeps track of the -best intermediate structures while the max-product algorithm just tracks the single best intermediate structure. Formally, the th largest element from a discrete set is defined as
We present the algorithm in the simple case of chain structured graphical models in Algo. 2. The top- max-product algorithm for general trees is given in Algo. 8 in Appendix B. Note that it requires times the time and space of the max oracle.