LP-SparseMAP: Differentiable Relaxed Optimizationfor Sparse Structured Prediction

LP-SparseMAP: Differentiable Relaxed Optimization for Sparse Structured Prediction

Abstract

Structured prediction requires manipulating a large number of combinatorial structures, e. g., dependency trees or alignments, either as latent or output variables. Recently, the SparseMAP method has been proposed as a differentiable, sparse alternative to maximum a posteriori (MAP) and marginal inference. SparseMAP returns a combination of a small number of structures, a desirable property in some downstream applications. However, SparseMAP requires a tractable MAP inference oracle. This excludes, e. g., loopy graphical models or factor graphs with logic constraints, which generally require approximate inference. In this paper, we introduce LP-SparseMAP, an extension of SparseMAP that addresses this limitation via a local polytope relaxation. LP-SparseMAP uses the flexible and powerful domain specific language of factor graphs for defining and backpropagating through arbitrary hidden structure, supporting coarse decompositions, hard logic constraints, and higher-order correlations. We derive the forward and backward algorithms needed for using LP-SparseMAP as a hidden or output layer. Experiments in three structured prediction tasks show benefits compared to SparseMAP and Structured SVM.

1 Introduction

sleep

the

clock

around

BUDGET factor

TREE factor

sleepthe

sleepclock

sleeparound

thesleep

theclock

thearound

clocksleep

clockthe

clockaround

aroundsleep

aroundthe

aroundclock

Figure 1: Parsing model with valency constraints: each ‘‘head’’ word is constrained to have at most ‘‘modifiers’’. LP-SparseMAP is the first method for tractable, differentiable decoding in such a model.

The data processed by machine learning systems often has underlying structure: for instance, language data has inter-word dependency trees, or alignments, while image data can have meaningful segmentations. As downstream models benefit from the hidden structure, practitioners typically resort to pipelines, training a structure predictor on labelled data, and using its output as features. This approach requires annotation, suffers from error propagation, and cannot allow the structure predictor to adapt to the downstream task.

Instead, a promising direction is to treat structure as latent, or hidden: learning a structure predictor without supervision, together with the downstream model in an end-to-end fashion. Several recent approaches were proposed to tackle this, based on differentiating through marginal inference (Kim et al., 2017; Liu and Lapata, 2018) or noisy gradient estimates (Peng et al., 2018; Yogatama et al., 2017), or both (Corro and Titov, 2019a, b). These approaches require specialized, structure-specific algorithms either for computing gradients or for sampling, limiting the choice of the practitioner to a catalogue of supported types structure. A slightly more general approach is SparseMAP (Niculae et al., 2018), which is differentiable and outputs combinations of a small number of structures, requiring only an algorithm for MAP. However, it is often desirable to increase the expressiveness of structured models with logic constraints or higher-order interactions. This complicates the search space and typically makes exact maximization intractable. For example, adding constraints on the depth of a parse tree typically makes the problems NP-hard. We relax these stringent limitations and improve practitioners’ modeling freedom through the following contributions:

  • We propose a generic method for differentiable structured hidden layers, based on the flexible domain-specific language of factor graphs, familiar to many structured prediction practitioners.

  • We derive an efficient and globally-convergent ADMM algorithm for the forward pass.

  • We prove a compact, efficient form for the backward pass, reusing quantities precomputed in the forward pass and avoiding the need to unroll a computation graph.

  • Our overall method is modular: new factor types can be added to our toolkit just by providing a MAP oracle or, if available, specialized SparseMAP forward and backward functions.

  • We derive the specialized computation described above for core building block factors such as pairwise, logical OR, negation, budget constraints, etc., ensuring our toolkit is expressive out-of-the-box.

We show empirical improvements on inducing latent trees on arithmetic expressions, bidirectional alignments in natural language inference, and multilabel classification. Our library is available at https://github.com/deep-spin/lp-sparsemap.

2 Background

2.1 Notation

We denote scalars, vectors and matrices as , , and , respectively. The set of indices is denoted . The th column of matrix is . The canonical simplex is and the convex hull is . We denote row-wise stacking of as . Particularly, is the concatenation of two (column) vectors. Given a vector , is the diagonal matrix with along the diagonal. Given matrices of arbitrary dimensions , denote
.

2.2 Tractable structured problems

Structured prediction involves searching for valid structures over a large, combinatorial space . We assign a vector representation to each structure. For instance, we may consider structures to be joint assignments of binary variables (corresponding to parts of the structure) and define if variable is turned on in structure , else . The set of valid structures is typically non-trivial. For example, in matching problems between workers and tasks, we have binary variables, but the only legal assignments give exactly one task to each worker, and one worker to each task.

Maximization (MAP). Given a score vector over parts , we assign a score to each structure. Assembling all as columns of a matrix , the highest-scoring structure is the one maximizing

(1)

is called the marginal polytope (Wainwright and Jordan, 2008), and points are expectations under some .

In the sequel, we split such that is the output of interest, (e. g., variable assignments), sometimes called unaries), while captures additional structures or interactions (e. g., transitions in sequence tagging). This distinction is not essential, as we may always take (i. e., treat additional interactions as first-class variables), but it is more consistent with pairwise Markov Random Fields (MRF).

Optimization as a hidden layer. Consider viewing MAP as a function, breaking ties arbitrarily:

(2)

Almost everywhere, small changes to do not change the highest-scoring structure. Thus, for any locally-continuous slice of , , making it unsuitable as a hidden layer in a neural network trained with gradient-based optimization (Peng et al., 2018).

Marginal inference. For unstructured maximization, (as seen, for instance, in attention mechanisms), it is common to replace with its relaxation . Denote the Shannon entropy of a distribution by . The structured relaxation of MAP, analogously to softmax, is the entropy-regularized problem

(3)

whose solution is This Gibbs distribution is dense and induces a marginal distribution over variable assignments (Wainwright and Jordan, 2008):

(4)

While generally intractable, for for certain models, such as sequence tagging, one can efficiently compute and (often, with dynamic programming, Kim et al., 2017). In many, it is intractable, e. g., matching (Valiant, 1979; Taskar, 2004, Section 3.5), dependency parsing with valency constraints (McDonald and Satta, 2007).

SparseMAP (Niculae et al., 2018) is a differentiable middle ground between maximization and expectation. It is defined via the quadratic objective

(5)

where an optimal sparse distribution and the unique can be efficiently computed via the active set method (Nocedal and Wright, 1999, Ch. 16.4 & 16.5), a generalization of Wolfe’s min-norm point method (Wolfe, 1976) and an instance of Conditional Gradient (Frank and Wolfe, 1956). Remarkably, the active set method only requires calls to a maximization oracle (i. e., finding the highest-scoring structure repeatedly, after adjustments), and has linear, finite convergence. This means SparseMAP can be computed efficiently even for structures where marginal inference is not available, potentially turning any structured problem with a maximization algorithm available into a differentiable sparse structured hidden layer. The sparsity not only brings computational advantages, but also aids visualization and interpretation.

However, the requirement of an exact maximization algorithm is still a rather stringent limitation. In the remainder of the section, we look into a flexible family of structured models where maximization is hard. Then, we extend SparseMAP to cover all such models.

sleepthe

sleepclock

sleeparound

thesleep

theclock

thearound

clocksleep

clockthe

clockaround

aroundsleep

aroundthe

aroundclock

Figure 2: Matching model under two equivalent decompositions. Left: a coarse one with a single factor. Right: a fine one with multiple XOR factors.

2.3 Intractable structured problems and factor graph representations

We now turn to more complicated structured problems, consisting of multiple interacting subproblems. As we shall see, this covers many interesting problems.

Essentially, we represent the global structure as assignments to variables, and posit a decomposition of the problem into local factors , each encoding locally-tractable scoring and constraints (Kschischang et al., 2001). A factor may be seen as smaller structured subproblem. Crucially, on the variables where multiple factors overlap, they must agree, rendering the subproblems interdependent, non-separable.

Examples. Figure 1 shows a factor graph for a dependency parsing problem in which prior knowledge dictates valency constraints, i. e., disallowing words to be assigned more than dependent modifiers. This encourages depth, preventing trees from being too flat. For a sentence with words, we use binary variables for every possible arc, (including the root arcs, omitted in the figure). The global tree factor disallows assignments that are not trees, and the budget constraint factors, each governing different variables, disallow more than dependency arcs out of each word. Factor graph representations are often not unique. For instance, consider a matching (linear assignment) model (Figure 2). We may employ a coarse factorization consisting of a single matching factor, for which maximization is tractable thanks to the Kuhn-Munkres algorithm (Kuhn, 1955). This problem can also be represented using multiple XOR factors, constraining that each row and each column must have exactly (exclusively) one selected variable.

To be formal, denote the variable assignments as . For each factor , we encode its legal assignments as columns of a matrix , and define a selector matrix such that ‘‘selects’’ the part of covered by the factor . Then, a valid global assignment can be represented as a tuple of local assignments , provided that the agreement constraints are satisfied:

(6)

Finding the highest scoring structure has the same form as in the tractable case, but the discrete agreement constraints in make it difficult to compute, even when each factor is simple:

(7)

For compactness, consider the concatenations

and the block-diagonal matrices

We may then write the optimization problem

(8)

continuously relaxing each factor independently while enforcing agreement. The objective in Equation 8 is separable, but the constraints are not. The feasible set,

(9)

is called the local polytope and satisfies . Therefore, (8) is a relaxation of (7), known as LP-MAP (Wainwright and Jordan, 2008). In general, the inclusion is strict. Many LP-MAP algorithms exploiting the graphical model structure have been proposed, from the perspective of message passing or dual decomposition. (Wainwright et al., 2005; Kolmogorov, 2006; Komodakis et al., 2007; Globerson and Jaakkola, 2007; Koo et al., 2010). In particular, AD3 (Martins et al., 2015) tackles LP-MAP by solving a SparseMAP-like quadratic subproblem for each factor. In the next section, we use this connection to extend AD3 to a smoothed objective, resulting in a general algorithm for sparse differentiable inference.

3 LP-SparseMAP

By analogy to Equation 5, we propose the differentiable LP-SparseMAP inference strategy:

(10)

Unlike LP-MAP (Equation 8), LP-SparseMAP has a non-separable term in the objective. Separating it requires nontrivial accounting for variables appearing in multiple subproblems. We tackle this in the next proposition, reformulating Equation 10 as consensus optimization.

Proposition 1.

Denote by , the number of factors governing .1 Define as , and . Denote . Then, the problem below is equivalent to (10):

(11)
subject to
Proof.

The constraints and are equivalent since ensures invertible. It remains to show that, at feasibility, . This follows from (shown in Appendix A). ∎

3.1 Forward pass

Using this reformulation, we are now ready to introduce an ADMM algorithm (Glowinski and Marroco, 1975; Gabay and Mercier, 1976; Boyd et al., 2011) for maximizing Equation 11. The algorithm is given in Algorithm 1 and derived in Appendix B. Like AD3, it iterates alternating between:

  1. solving a SparseMAP subproblem for each factor; (With the active set algorithm, this requires only cheap calls to a MAP oracle.)

  2. enforcing global agreement by averaging;

  3. performing a gradient update on the dual variables.

Proposition 2.

Algorithm 1 converges to a solution of (10); moreover, the number of iterations needed to reach dual suboptimality is .

Proof.

The algorithm is an instantiation of ADMM to Equation 11, inheriting the proof of convergence of ADMM. (Boyd et al., 2011, Appendix A). From Proposition 1, this problem is equivalent to (10). Finally, the rate of convergence is established by Martins et al. (2015, Proposition 8), as the problems differ only through an additional regularization term in the objective. ∎

When there is a single factor, i. e., , setting achieves convergence in a single outer iteration. In this case, since , we recover SparseMAP exactly.

3.2 Backward pass

Unlike marginal inference, LP-SparseMAP encourages the local distribution at each factor to become sparse. This results in a simple form for the LP-SparseMAP Jacobian, defined in terms of the local SparseMAP Jacobians of each factor (Appendix C.1). Denote the local solutions and the Jacobians of the SparseMAP subproblem for each factor as

(12)

When using the active set algorithm for SparseMAP, are precomputed in the forward pass (Niculae et al., 2018). The LP-SparseMAP backward pass combines the local Jacobians while taking into account the agreement constraints, as shown next.

\setstretch

1.15

Algorithm 1 ADMM for LP-SparseMAP
1Input: (scores), (max. iterations), (ADMM step size), (primal and dual stopping criteria).
2Output: () solving Equation 10.
3Initialization: .
4for  do
5  for all  do SparseMAP subproblem
6    
7    
8       
9   agreement by local averaging
10   dual update
11  if  &  then return converged   
Proposition 3.

Let and denote the block-diagonal matrices of local SparseMAP Jacobians. Consider the fixed point

(13)
(14)

The proof is given in Appendix C.2, and may be computed using an eigensolver. However, to use LP-SparseMAP as a hidden layer, we do not need materialized Jacobians, just access to Jacobian-vector products

These can be computed iteratively by Algorithm 2. Since are highly sparse and structured selector matrices, lines 5 and 7 are simple indexing operations followed by scaling; the bulk of the computation is line 6, which can be seen as invoking the backward pass of each factor, as if that factor were alone in the graph. The structure of Algorithm 2 is similar to Algorithm 1, however, our backward is much more efficient than ‘‘unrolling’’ Algorithm 1 within a computation graph: Our algorithm only requires access to the final state of the ADMM solver (Algorithm 1), rather than all intermediate states, as would be required for unrolling.

3.3 Implementation and specializations

The forward and backward passes of LP-SparseMAP, described above, are appealing from the perspective of modular implementation. The outer loop interacts with a factor with only two interfaces: a SolveSparseMAP function and a JacobianTimesVector function. In turn, both methods can be implemented in terms of a SolveMAP maximization oracle (Niculae et al., 2018).

\setstretch

1.15

Algorithm 2 Backward pass for LP-SparseMAP
1Input: (the gradient of the loss w. r. t. ), (the maximum number of iterations), (stopping criterion).
2Output: (loss gradient w. r. t. and ).
3for  do
4  for all  do
5    ; split into copies for each factor
6    ; local   
7  . local averaging
8  if  then
9    return . converged
10  else    

For certain factors, such as the logic constraints in Table 1, faster direct implementations of SolveSparseMAP and JacobianTimesVector are available, and our algorithm easily allows specialization. This is appealing from a testing perspective, as the specializations must agree with the generic implementation.

For example, the exclusive-or XOR factor requires that exactly one out of variables can be on. Its marginal polytope is the convex hull of allowed assignments, . The required SparseMAP subproblem with degree corrections is

(15)

When this is a projection onto the simplex (sparsemax), and efficient algorithms are known for its forward pass and backward pass (Martins and Astudillo, 2016). For general , the algorithm of Pardalos and Kovoor (1990) applies, and the backward pass involves a generalization of the sparsemax Jacobian.

In Appendix D, we derive specialized forward and backward passes for XOR, and the constraint factors in Table 1, as well as for negated variables, OR, OR-Output, Knapsack and pairwise (Ising) factors.

name constraints
XOR (exactly one)
AtMostOne
OR
BUDGET
Knapsack
OROut
Table 1: Examples of logic constraint factors.

4 LP-SparseMAP loss for structured outputs

So far, we described LP-SparseMAP for structured hidden layers. When supervision is available, either as a downstream objective or as partial supervision over latent structures, there is a natural convex loss relaxing the SparseMAP loss (Niculae et al., 2018):

(16)

under the constraints of Equation 10. Like the SparseMAP loss, this LP-SparseMAP loss falls into the recently-proposed class of Fenchel-Young losses (Blondel et al., 2019a), therefore it is a well-behaved loss and, moreover, it naturally has a margin property (Blondel et al., 2019b, Proposition 8). Its gradients are obtained from the LP-SparseMAP solution as

(17)
(18)

When already using LP-SparseMAP as a hidden layer, this loss provides a natural way to incorporate supervision on the latent structure at no additional cost.

5 Experiments

In this section, we demonstrate LP-SparseMAP for learning complex latent structures on both toy and real-world datasets, as well as on a structured output task. Learning hidden structures solely from a downstream objective is challenging for powerful models that can bypass the latent component entirely. For this reason, we design our experiments using simpler, smaller networks where the inferred structure is an un-bypassable bottleneck, ensuring the predictions depend on it. We use Dynet (Neubig et al., 2017) and list hyperparameter configurations and ranges in Appendix E.

5.1 ListOps valency tagging

Figure 3: score for tagging ListOps nodes with their valency, using a latent tree. Incorporating inductive bias via budget constraints improves performance.

The ListOps dataset (Nangia and Bowman, 2018) is a synthetic collection of bracketed expressions, such as [max 2 9 [min 4 7 ] 0 ]. The arguments are lists of integers, and the operators are set summarizers such as median, max, sum, etc. It was proposed as a litmus test for studying latent tree learning models, since the syntax is essential to the semantics. Instead of tackling the challenging task of learning to evaluate the expressions, we follow Corro and Titov (2019b) and study a tagging task: labeling each operator with the number of arguments it governs.

Model architecture. we encode the sequence with a BiLSTM, yielding vectors . We compute the score of dependency arc as the dot product between the outputs of two mappings, one for encoding the head and one for the modifier:

We perform LP-SparseMAP optimization to get the sparse arc posterior probabilities, using different factor graph structure , described in the next paragraph.

(19)

The arc posteriors correspond to a sparse combination of dependency trees. We perform one iteration of a Graph Convolutional Network (GCN) along the edges in . Crucially, the input to the GCN is not the BiLSTM output but a ‘‘de-lexicalized’’ sequence where is a learned parameter vector, repeated times regardless of the tokens. This forces the predictions to rely on the GCN and thus on the latent trees, preventing the model from using the global BiLSTM to ‘‘cheat’’. The GCN produces contextualized representations which we then pass through an output layer to predict the valency label for each operator node.

Factor graphs. Unlike Corro and Titov (2019b), who use projective dependency parsing, we consider the general non-projective case, making the problem more challenging. The MAP oracle is the maximum arborescence algorithm (Chu and Liu, 1965; Edmonds, 1967).

validation test
Acc. Acc.
left-to-right 28.14 17.54 28.07 17.43
tree 68.23 68.74 68.74 69.12
tree+budget 82.35 82.59 82.75 82.95
Table 2: ListOps tagging results with non-projective latent trees. The budget constraints bring improvement.

First, we consider a factor graph with a single non-projective TREE factor: in this case, LP-SparseMAP reduces to a SparseMAP baseline. Motivated by multiple observations that SparseMAP and similar latent structure learning methods tend to learn trivial trees (Williams et al., 2018) we next consider overlaying constraints in the form of BUDGET factors on top of the TREE factor. For every possible head , we include a BUDGET factor allowing at most five of the possible outgoing arcs to be selected.

Results. Figure 3 confirms that, unsurprisingly, the baseline with access to gold dependency structure quickly learns to predict perfectly, while the simple left-to-right baseline cannot progress. LP-SparseMAP with BUDGET constraints on the modifiers outperforms SparseMAP by over 10 percentage points (Table 2).

5.2 Natural language inference with decomposable structured attention

We now turn to the task of natural language inference, using LP-SparseMAP to uncover hidden alignments for structured attention networks. Natural language inference is a pairwise classification task. Given a premise of length , and a hypothesis of length , the pair must be classified into one of three possible relationships: entailment, contradiction, or neutrality. We use the English language SNLI and MultiNLI datasets (Bowman et al., 2015; Williams et al., 2017), with the same preprocessing and splits as Niculae et al. (2018).

Model architecture. We use the decomposable attention model of Parikh et al. (2016) with no intra-attention. The model computes a joint attention score matrix of size , where depends only on th word in the premise and the th word in the hypothesis (hence decomposable). For each premise word , we apply softmax over the th row of to get a weighted average of the hypothesis. Then, similarly, for each hypothesis word , we apply softmax over the th row of yielding a representation of the premise. From then on, each word embedding is combined with its corresponding weighted context using an affine function, the results are sum-pooled and passed through an output multi-layer perceptron to make a classification. We propose replacing the independent softmax attention with structured, joint attention, normalizing over both rows and columns simultaneously in several different ways, using LP-SparseMAP with scores . We use frozen GloVe embeddings (Pennington et al., 2014), and all our models have 130k parameters (cf. Appendix E).

Factor graphs. Assume . First, like Niculae et al. (2018), we consider a matching factor :

(20)
SNLI MultiNLI
valid test valid test
softmax 84.44 84.62 70.06 69.42
matching 84.57 84.16 70.84 70.36
LP-matching 84.70 85.04 70.57 70.64
LP-sequential 83.96 83.67 71.10 71.17
Table 3: NLI accuracy scores with structured attention. The LP-SparseMAP models perform competitively.

When , linear maximization on this constraint set corresponds to the linear assignment problem, solved by the Kuhn-Munkres (Kuhn, 1955) and Jonker-Volgenant (Jonker and Volgenant, 1987) algorithms, and the solution is a doubly stochastic matrix. When , the scores can be padded with to a square matrix prior to invoking the algorithm. A linear maximization thus takes , and this instantiation of structured matching attention can be tackled by SparseMAP. Next we consider a relaxed equivalent formulation which we call LP-matching, as shown in Figure 2, with one XOR factor per row and one AtMostOne factor per column:

(21)

Each subproblem can be solved in for a total complexity of per iteration (cf. Appendix D). While more iterations may be necessary to converge, the finer-grained approach might make faster progress, yielding more useful latent alignments. Finally, we consider a more expressive joint alignment that encourages continuity. Inspired by the sequential alignment of Niculae et al. (2018), we propose a bi-directional model called LP-sequence, consisting of Sequence factor over the premise, with a possible state for each aligned word in the hypothesis, with a single transition score for every pair of alignments . By itself, this factor may align multiple premise words to the same hypothesis word, circumvented by Niculae et al. (2018) by running the optimization in both directions independently. Instead, we propose adding AtMostOne factors, like in Equation 21, ensuring each hypothesis word is aligned on average to at most one premise word. Effectively, this is like a sequence tagger allowed to use each of the states at most once. For both LP-SparseMAP approaches, we rescale the result by row sums to ensure feasibility.

Figure 4: Attention induced using softmax (left) and LP-SparseMAP sequential (right) on a MultiNLI example. With this inductive bias, LP-SparseMAP learns a bi-directional alignment anchoring longer phrases.

Results. Table 3 reveals that LP-matching is the best performing mechanism on SNLI, and LP-sequential on MultiNLI. The transition score learned by LP-sequential is 1.6 on SNLI and 2.5 on MultiNLI, and Figure 4 shows an example of the useful inductive bias it learns. On both datasets, the relaxed LP-matching outperforms the coarse matching factor, suggesting that, indeed, equivalent parametrizations of a model may perform differently when not run until convergence.

5.3 Multilabel classification

Finally, to confirm that LP-SparseMAP is also suitable as in the supervised setting, we evaluate on the task of multilabel classification. Our factor graph has binary variables (one for each label), and a pairwise factor with a score for every label co-occurrence:

(22)
bibtex bookmarks
Unstructured 42.28 35.76
Structured hinge loss 37.70 33.26
LP-SparseMAP loss 43.43 36.07
Table 4: Multilabel classification test scores.

Neural network parametrization. We use a 2-layer multi-layer perceptron to compute the score for each variable. In the structured models, we have an additional parameters for the co-occurrence score of every pair of classes. We compare an unstructured baseline (using the binary logistic loss for each label), a structured hinge loss (with LP-MAP inference) and a LP-SparseMAP loss model. We solve LP-MAP using AD3 and LP-SparseMAP with our proposed algorithm, (cf. Appendix E).

Results. Table 4 shows the example score on the test set for the bibtex and bookmarks benchmark datasets (Katakis et al., 2008). The structured hinge loss model is worse than the unstructured (binary logistic loss) baseline; the LP-SparseMAP loss model outperforms both. This suggests that the LP-SparseMAP loss is promising for structured output learning. We note that, in strictly-supervised setting, approaches that blend inference with learning, such as (Chen et al., 2015; Tang et al., 2016) may be more efficient; however, LP-SparseMAP can work out-of-the-box as a hidden layer as well.

6 Related work

Differentiable optimization. The most related research direction involves bi-level optimization, or argmin differentiation (Gould et al., 2016); Typically, such research assumes problems are expressible in a standard form, for instance using quadratic programs (Amos and Kolter, 2017) or disciplined convex programs, based on a conic reformulation (Agrawal et al., 2019a, b). Such approaches are not applicable for the typical optimization problems arising in structured prediction, because of the intractably large number of constraints typically necessary, and the difficulty of formulating many problems in standard forms. Our method instead assumes interacting through the problem through local oracle algorithms, exploiting the structure of the factor graph and allowing for more efficient handling of coarse factors (e. g., TREE) and logic constraints.

Latent structure models. Our motivation and applications are mostly focused on learning with latent structure. Specifically, we are interested in global optimization methods, which require marginal inference or similar relaxations (Kim et al., 2017; Liu and Lapata, 2018; Corro and Titov, 2019a, b; Niculae et al., 2018), rather than incremental methods based on policy gradients (Yogatama et al., 2017). Promising methods exist for approximate marginal inference in factor graphs with MAP calls (Belanger et al., 2013; Krishnan et al., 2015; Tang et al., 2016), relying on entropy approximation penalties. Such approaches focus on supervised structure prediction, which is not our main goal; and their backward passes has not been studied to our knowledge. Importantly, as these penalties are non-quadratic, the Active Set algorithm does not apply, falling back to the more general variants of Frank-Wolfe. Active Set is a key ingredient of our work, as it exhibits fast finite convergence, sparse solutions and -- crucially -- precomputation of the matrix inverse required in the backward pass (Niculae et al., 2018). Moreover, the backward pass of these methods has not been studied. Instead, the quadratic penalty pioneered by Niculae et al. (2018) is more amenable to optimization, as well as bringing other sparsity benefits. It may be tempting to directly apply SparseMAP with an approximate LP-MAP oracle. The projection step of Peng et al. (2018) can be cast as a SparseMAP problem, thus our algorithm can be used to extend their method to arbitrary factor graphs. For pairwise MRFs (a class of factor graphs), differentiating belief propagation, either through unrolling or perturbation-based approximation, has been studied (Stoyanov et al., 2011; Domke, 2013). Our approach instead computes implicit gradients, which is more efficient, thanks to quantities precomputed in the forward pass, and in some circumstances has been shown to work better (Rajeswaran et al., 2019). Finally, none of these approaches can inherently handle logic constraints or coarse factors.

7 Conclusions

We introduced LP-SparseMAP, an extension of SparseMAP to sparse differentiable optimization in any factor graph, enabling neural hidden layers with arbitrarily complex structure, specified using a familiar domain-specific language. We have shown LP-SparseMAP to outperform SparseMAP for latent structure learning, and its corresponding loss function to outperform the structured hinge for structured output learning. We hope that our toolkit empowers future research on latent structure models, improving efficiency for smaller networks through inductive bias.

Supplementary Material

Appendix A Separable reformulation of LP-SparseMAP

Lemma 1.

Let , , , defined as in Proposition 1. Let . Then,

  1. ;

  2. ;

  3. For any feasible pair , , and

Proof.

(i) The matrix , which expresses the agreement constraint , is a stack of selector matrices, in other words, its sub-blocks are either the identity or the zero matrix . We index its rows by pairs , and its columns by . Denote by the fact that the th variable under factor is . Then, . We can then explicitly compute

If , , so .

(ii) By construction, for the unique variable with . Thus,

(iii) It follows from (i) and (ii) that .

(iv) Since is full-rank, the feasibility condition is equivalent to . Left-multiplying by yields . Moreover,

Appendix B Derivation of updates and comparison to LP-MAP

Recall the problem we are trying to minimize, from Equation 11:

(23)

Since the simplex constraints are separable, we may move them to the objective, yielding

(24)

The -augmented Lagrangian of problem 24 is

(25)

The solution is a saddle point of the Lagrangian, i. e., a solution of

(26)

ADMM optimizes Equation 26 in a block-coordinate fashion; we next derive each block update.

b.1 Updating

We update for each independently by solving:

(27)

Denoting , we have that

The -augmented term regularizing the subproblems toward the current estimate of the global solution is

For each factor, the subproblem objective is therefore:

(28)

This is exactly a SparseMAP instance with and .

Observation. For comparison, when solving LP-MAP with AD3, the subproblems minimize the objective

(29)

so the -update is a SparseMAP instance with and . Notable differences is the scaling by instead of (corresponding to the added regularization), and the diagonal degree reweighting.

b.2 Updating

We must solve

(30)

This is an unconstrained problem. Setting the gradient of the objective to , we get

(31)

with the unique solution