LPSparseMAP: Differentiable Relaxed Optimization for Sparse Structured Prediction
Abstract
Structured prediction requires manipulating a large number of combinatorial structures, e. g., dependency trees or alignments, either as latent or output variables. Recently, the SparseMAP method has been proposed as a differentiable, sparse alternative to maximum a posteriori (MAP) and marginal inference. SparseMAP returns a combination of a small number of structures, a desirable property in some downstream applications. However, SparseMAP requires a tractable MAP inference oracle. This excludes, e. g., loopy graphical models or factor graphs with logic constraints, which generally require approximate inference. In this paper, we introduce LPSparseMAP, an extension of SparseMAP that addresses this limitation via a local polytope relaxation. LPSparseMAP uses the flexible and powerful domain specific language of factor graphs for defining and backpropagating through arbitrary hidden structure, supporting coarse decompositions, hard logic constraints, and higherorder correlations. We derive the forward and backward algorithms needed for using LPSparseMAP as a hidden or output layer. Experiments in three structured prediction tasks show benefits compared to SparseMAP and Structured SVM.
1 Introduction
The data processed by machine learning systems often has underlying structure: for instance, language data has interword dependency trees, or alignments, while image data can have meaningful segmentations. As downstream models benefit from the hidden structure, practitioners typically resort to pipelines, training a structure predictor on labelled data, and using its output as features. This approach requires annotation, suffers from error propagation, and cannot allow the structure predictor to adapt to the downstream task.
Instead, a promising direction is to treat structure as latent, or hidden: learning a structure predictor without supervision, together with the downstream model in an endtoend fashion. Several recent approaches were proposed to tackle this, based on differentiating through marginal inference (Kim et al., 2017; Liu and Lapata, 2018) or noisy gradient estimates (Peng et al., 2018; Yogatama et al., 2017), or both (Corro and Titov, 2019a, b). These approaches require specialized, structurespecific algorithms either for computing gradients or for sampling, limiting the choice of the practitioner to a catalogue of supported types structure. A slightly more general approach is SparseMAP (Niculae et al., 2018), which is differentiable and outputs combinations of a small number of structures, requiring only an algorithm for MAP. However, it is often desirable to increase the expressiveness of structured models with logic constraints or higherorder interactions. This complicates the search space and typically makes exact maximization intractable. For example, adding constraints on the depth of a parse tree typically makes the problems NPhard. We relax these stringent limitations and improve practitioners’ modeling freedom through the following contributions:

We propose a generic method for differentiable structured hidden layers, based on the flexible domainspecific language of factor graphs, familiar to many structured prediction practitioners.

We derive an efficient and globallyconvergent ADMM algorithm for the forward pass.

We prove a compact, efficient form for the backward pass, reusing quantities precomputed in the forward pass and avoiding the need to unroll a computation graph.

Our overall method is modular: new factor types can be added to our toolkit just by providing a MAP oracle or, if available, specialized SparseMAP forward and backward functions.

We derive the specialized computation described above for core building block factors such as pairwise, logical OR, negation, budget constraints, etc., ensuring our toolkit is expressive outofthebox.
We show empirical improvements on inducing latent trees on arithmetic expressions, bidirectional alignments in natural language inference, and multilabel classification. Our library is available at https://github.com/deepspin/lpsparsemap.
2 Background
2.1 Notation
We denote scalars, vectors and matrices as , , and ,
respectively.
The set of indices is denoted .
The ^{th} column of matrix is
.
The canonical simplex is
and the convex hull is
.
We denote rowwise stacking of as
.
Particularly, is the concatenation of two (column) vectors.
Given a vector ,
is the diagonal matrix with along the diagonal.
Given matrices
of arbitrary dimensions , denote
.
2.2 Tractable structured problems
Structured prediction involves searching for valid structures over a large, combinatorial space . We assign a vector representation to each structure. For instance, we may consider structures to be joint assignments of binary variables (corresponding to parts of the structure) and define if variable is turned on in structure , else . The set of valid structures is typically nontrivial. For example, in matching problems between workers and tasks, we have binary variables, but the only legal assignments give exactly one task to each worker, and one worker to each task.
Maximization (MAP). Given a score vector over parts , we assign a score to each structure. Assembling all as columns of a matrix , the highestscoring structure is the one maximizing
(1) 
is called the marginal polytope (Wainwright and Jordan, 2008), and points are expectations under some .
In the sequel, we split such that is the output of interest, (e. g., variable assignments), sometimes called unaries), while captures additional structures or interactions (e. g., transitions in sequence tagging). This distinction is not essential, as we may always take (i. e., treat additional interactions as firstclass variables), but it is more consistent with pairwise Markov Random Fields (MRF).
Optimization as a hidden layer. Consider viewing MAP as a function, breaking ties arbitrarily:
(2) 
Almost everywhere, small changes to do not change the highestscoring structure. Thus, for any locallycontinuous slice of , , making it unsuitable as a hidden layer in a neural network trained with gradientbased optimization (Peng et al., 2018).
Marginal inference. For unstructured maximization, (as seen, for instance, in attention mechanisms), it is common to replace with its relaxation . Denote the Shannon entropy of a distribution by . The structured relaxation of MAP, analogously to softmax, is the entropyregularized problem
(3) 
whose solution is This Gibbs distribution is dense and induces a marginal distribution over variable assignments (Wainwright and Jordan, 2008):
(4) 
While generally intractable, for for certain models, such as sequence tagging, one can efficiently compute and (often, with dynamic programming, Kim et al., 2017). In many, it is intractable, e. g., matching (Valiant, 1979; Taskar, 2004, Section 3.5), dependency parsing with valency constraints (McDonald and Satta, 2007).
SparseMAP (Niculae et al., 2018) is a differentiable middle ground between maximization and expectation. It is defined via the quadratic objective
(5) 
where an optimal sparse distribution and the unique can be efficiently computed via the active set method (Nocedal and Wright, 1999, Ch. 16.4 & 16.5), a generalization of Wolfe’s minnorm point method (Wolfe, 1976) and an instance of Conditional Gradient (Frank and Wolfe, 1956). Remarkably, the active set method only requires calls to a maximization oracle (i. e., finding the highestscoring structure repeatedly, after adjustments), and has linear, finite convergence. This means SparseMAP can be computed efficiently even for structures where marginal inference is not available, potentially turning any structured problem with a maximization algorithm available into a differentiable sparse structured hidden layer. The sparsity not only brings computational advantages, but also aids visualization and interpretation.
However, the requirement of an exact maximization algorithm is still a rather stringent limitation. In the remainder of the section, we look into a flexible family of structured models where maximization is hard. Then, we extend SparseMAP to cover all such models.
2.3 Intractable structured problems and factor graph representations
We now turn to more complicated structured problems, consisting of multiple interacting subproblems. As we shall see, this covers many interesting problems.
Essentially, we represent the global structure as assignments to variables, and posit a decomposition of the problem into local factors , each encoding locallytractable scoring and constraints (Kschischang et al., 2001). A factor may be seen as smaller structured subproblem. Crucially, on the variables where multiple factors overlap, they must agree, rendering the subproblems interdependent, nonseparable.
Examples. Figure 1 shows a factor graph for a dependency parsing problem in which prior knowledge dictates valency constraints, i. e., disallowing words to be assigned more than dependent modifiers. This encourages depth, preventing trees from being too flat. For a sentence with words, we use binary variables for every possible arc, (including the root arcs, omitted in the figure). The global tree factor disallows assignments that are not trees, and the budget constraint factors, each governing different variables, disallow more than dependency arcs out of each word. Factor graph representations are often not unique. For instance, consider a matching (linear assignment) model (Figure 2). We may employ a coarse factorization consisting of a single matching factor, for which maximization is tractable thanks to the KuhnMunkres algorithm (Kuhn, 1955). This problem can also be represented using multiple XOR factors, constraining that each row and each column must have exactly (exclusively) one selected variable.
To be formal, denote the variable assignments as . For each factor , we encode its legal assignments as columns of a matrix , and define a selector matrix such that ‘‘selects’’ the part of covered by the factor . Then, a valid global assignment can be represented as a tuple of local assignments , provided that the agreement constraints are satisfied:
(6) 
Finding the highest scoring structure has the same form as in the tractable case, but the discrete agreement constraints in make it difficult to compute, even when each factor is simple:
(7) 
For compactness, consider the concatenations
and the blockdiagonal matrices
We may then write the optimization problem
(8)  
continuously relaxing each factor independently while enforcing agreement. The objective in Equation 8 is separable, but the constraints are not. The feasible set,
(9) 
is called the local polytope and satisfies . Therefore, (8) is a relaxation of (7), known as LPMAP (Wainwright and Jordan, 2008). In general, the inclusion is strict. Many LPMAP algorithms exploiting the graphical model structure have been proposed, from the perspective of message passing or dual decomposition. (Wainwright et al., 2005; Kolmogorov, 2006; Komodakis et al., 2007; Globerson and Jaakkola, 2007; Koo et al., 2010). In particular, AD^{3} (Martins et al., 2015) tackles LPMAP by solving a SparseMAPlike quadratic subproblem for each factor. In the next section, we use this connection to extend AD^{3} to a smoothed objective, resulting in a general algorithm for sparse differentiable inference.
3 LPSparseMAP
By analogy to Equation 5, we propose the differentiable LPSparseMAP inference strategy:
(10)  
Unlike LPMAP (Equation 8), LPSparseMAP has a nonseparable term in the objective. Separating it requires nontrivial accounting for variables appearing in multiple subproblems. We tackle this in the next proposition, reformulating Equation 10 as consensus optimization.
Proposition 1.
Denote by ,
the number of factors governing .
(11)  
subject to  
Proof.
The constraints and are equivalent since ensures invertible. It remains to show that, at feasibility, . This follows from (shown in Appendix A). ∎
3.1 Forward pass
Using this reformulation, we are now ready to introduce an ADMM algorithm (Glowinski and Marroco, 1975; Gabay and Mercier, 1976; Boyd et al., 2011) for maximizing Equation 11. The algorithm is given in Algorithm 1 and derived in Appendix B. Like AD^{3}, it iterates alternating between:

solving a SparseMAP subproblem for each factor; (With the active set algorithm, this requires only cheap calls to a MAP oracle.)

enforcing global agreement by averaging;

performing a gradient update on the dual variables.
Proposition 2.
Proof.
The algorithm is an instantiation of ADMM to Equation 11, inheriting the proof of convergence of ADMM. (Boyd et al., 2011, Appendix A). From Proposition 1, this problem is equivalent to (10). Finally, the rate of convergence is established by Martins et al. (2015, Proposition 8), as the problems differ only through an additional regularization term in the objective. ∎
When there is a single factor, i. e., , setting achieves convergence in a single outer iteration. In this case, since , we recover SparseMAP exactly.
3.2 Backward pass
Unlike marginal inference, LPSparseMAP encourages the local distribution at each factor to become sparse. This results in a simple form for the LPSparseMAP Jacobian, defined in terms of the local SparseMAP Jacobians of each factor (Appendix C.1). Denote the local solutions and the Jacobians of the SparseMAP subproblem for each factor as
(12) 
When using the active set algorithm for SparseMAP, are precomputed in the forward pass (Niculae et al., 2018). The LPSparseMAP backward pass combines the local Jacobians while taking into account the agreement constraints, as shown next.
Proposition 3.
Let and denote the blockdiagonal matrices of local SparseMAP Jacobians. Consider the fixed point
(13) 
(14) 
The proof is given in Appendix C.2, and may be computed using an eigensolver. However, to use LPSparseMAP as a hidden layer, we do not need materialized Jacobians, just access to Jacobianvector products
These can be computed iteratively by Algorithm 2. Since are highly sparse and structured selector matrices, lines 5 and 7 are simple indexing operations followed by scaling; the bulk of the computation is line 6, which can be seen as invoking the backward pass of each factor, as if that factor were alone in the graph. The structure of Algorithm 2 is similar to Algorithm 1, however, our backward is much more efficient than ‘‘unrolling’’ Algorithm 1 within a computation graph: Our algorithm only requires access to the final state of the ADMM solver (Algorithm 1), rather than all intermediate states, as would be required for unrolling.
3.3 Implementation and specializations
The forward and backward passes of LPSparseMAP, described above, are appealing from the perspective of modular implementation. The outer loop interacts with a factor with only two interfaces: a SolveSparseMAP function and a JacobianTimesVector function. In turn, both methods can be implemented in terms of a SolveMAP maximization oracle (Niculae et al., 2018).
For certain factors, such as the logic constraints in Table 1, faster direct implementations of SolveSparseMAP and JacobianTimesVector are available, and our algorithm easily allows specialization. This is appealing from a testing perspective, as the specializations must agree with the generic implementation.
For example, the exclusiveor XOR factor requires that exactly one out of variables can be on. Its marginal polytope is the convex hull of allowed assignments, . The required SparseMAP subproblem with degree corrections is
(15)  
When this is a projection onto the simplex (sparsemax), and efficient algorithms are known for its forward pass and backward pass (Martins and Astudillo, 2016). For general , the algorithm of Pardalos and Kovoor (1990) applies, and the backward pass involves a generalization of the sparsemax Jacobian.
In Appendix D, we derive specialized forward and backward passes for XOR, and the constraint factors in Table 1, as well as for negated variables, OR, OROutput, Knapsack and pairwise (Ising) factors.
name  constraints 

XOR (exactly one)  
AtMostOne  
OR  
BUDGET  
Knapsack  
OROut 
4 LPSparseMAP loss for structured outputs
So far, we described LPSparseMAP for structured hidden layers. When supervision is available, either as a downstream objective or as partial supervision over latent structures, there is a natural convex loss relaxing the SparseMAP loss (Niculae et al., 2018):
(16) 
under the constraints of Equation 10. Like the SparseMAP loss, this LPSparseMAP loss falls into the recentlyproposed class of FenchelYoung losses (Blondel et al., 2019a), therefore it is a wellbehaved loss and, moreover, it naturally has a margin property (Blondel et al., 2019b, Proposition 8). Its gradients are obtained from the LPSparseMAP solution as
(17)  
(18) 
When already using LPSparseMAP as a hidden layer, this loss provides a natural way to incorporate supervision on the latent structure at no additional cost.
5 Experiments
In this section, we demonstrate LPSparseMAP for learning complex latent structures on both toy and realworld datasets, as well as on a structured output task. Learning hidden structures solely from a downstream objective is challenging for powerful models that can bypass the latent component entirely. For this reason, we design our experiments using simpler, smaller networks where the inferred structure is an unbypassable bottleneck, ensuring the predictions depend on it. We use Dynet (Neubig et al., 2017) and list hyperparameter configurations and ranges in Appendix E.
5.1 ListOps valency tagging
The ListOps dataset (Nangia and Bowman, 2018) is a synthetic collection of bracketed expressions, such as [max 2 9 [min 4 7 ] 0 ]. The arguments are lists of integers, and the operators are set summarizers such as median, max, sum, etc. It was proposed as a litmus test for studying latent tree learning models, since the syntax is essential to the semantics. Instead of tackling the challenging task of learning to evaluate the expressions, we follow Corro and Titov (2019b) and study a tagging task: labeling each operator with the number of arguments it governs.
Model architecture. we encode the sequence with a BiLSTM, yielding vectors . We compute the score of dependency arc as the dot product between the outputs of two mappings, one for encoding the head and one for the modifier:
We perform LPSparseMAP optimization to get the sparse arc posterior probabilities, using different factor graph structure , described in the next paragraph.
(19) 
The arc posteriors correspond to a sparse combination of dependency trees. We perform one iteration of a Graph Convolutional Network (GCN) along the edges in . Crucially, the input to the GCN is not the BiLSTM output but a ‘‘delexicalized’’ sequence where is a learned parameter vector, repeated times regardless of the tokens. This forces the predictions to rely on the GCN and thus on the latent trees, preventing the model from using the global BiLSTM to ‘‘cheat’’. The GCN produces contextualized representations which we then pass through an output layer to predict the valency label for each operator node.
Factor graphs. Unlike Corro and Titov (2019b), who use projective dependency parsing, we consider the general nonprojective case, making the problem more challenging. The MAP oracle is the maximum arborescence algorithm (Chu and Liu, 1965; Edmonds, 1967).
validation  test  

Acc.  Acc.  
lefttoright  28.14  17.54  28.07  17.43 
tree  68.23  68.74  68.74  69.12 
tree+budget  82.35  82.59  82.75  82.95 
First, we consider a factor graph with a single nonprojective TREE factor: in this case, LPSparseMAP reduces to a SparseMAP baseline. Motivated by multiple observations that SparseMAP and similar latent structure learning methods tend to learn trivial trees (Williams et al., 2018) we next consider overlaying constraints in the form of BUDGET factors on top of the TREE factor. For every possible head , we include a BUDGET factor allowing at most five of the possible outgoing arcs to be selected.
Results. Figure 3 confirms that, unsurprisingly, the baseline with access to gold dependency structure quickly learns to predict perfectly, while the simple lefttoright baseline cannot progress. LPSparseMAP with BUDGET constraints on the modifiers outperforms SparseMAP by over 10 percentage points (Table 2).
5.2 Natural language inference with decomposable structured attention
We now turn to the task of natural language inference, using LPSparseMAP to uncover hidden alignments for structured attention networks. Natural language inference is a pairwise classification task. Given a premise of length , and a hypothesis of length , the pair must be classified into one of three possible relationships: entailment, contradiction, or neutrality. We use the English language SNLI and MultiNLI datasets (Bowman et al., 2015; Williams et al., 2017), with the same preprocessing and splits as Niculae et al. (2018).
Model architecture. We use the decomposable attention model of Parikh et al. (2016) with no intraattention. The model computes a joint attention score matrix of size , where depends only on th word in the premise and the th word in the hypothesis (hence decomposable). For each premise word , we apply softmax over the ^{th} row of to get a weighted average of the hypothesis. Then, similarly, for each hypothesis word , we apply softmax over the ^{th} row of yielding a representation of the premise. From then on, each word embedding is combined with its corresponding weighted context using an affine function, the results are sumpooled and passed through an output multilayer perceptron to make a classification. We propose replacing the independent softmax attention with structured, joint attention, normalizing over both rows and columns simultaneously in several different ways, using LPSparseMAP with scores . We use frozen GloVe embeddings (Pennington et al., 2014), and all our models have 130k parameters (cf. Appendix E).
Factor graphs. Assume . First, like Niculae et al. (2018), we consider a matching factor :
(20) 
SNLI  MultiNLI  

valid  test  valid  test  
softmax  84.44  84.62  70.06  69.42 
matching  84.57  84.16  70.84  70.36 
LPmatching  84.70  85.04  70.57  70.64 
LPsequential  83.96  83.67  71.10  71.17 
When , linear maximization on this constraint set corresponds to the linear assignment problem, solved by the KuhnMunkres (Kuhn, 1955) and JonkerVolgenant (Jonker and Volgenant, 1987) algorithms, and the solution is a doubly stochastic matrix. When , the scores can be padded with to a square matrix prior to invoking the algorithm. A linear maximization thus takes , and this instantiation of structured matching attention can be tackled by SparseMAP. Next we consider a relaxed equivalent formulation which we call LPmatching, as shown in Figure 2, with one XOR factor per row and one AtMostOne factor per column:
(21)  
Each subproblem can be solved in for a total complexity of per iteration (cf. Appendix D). While more iterations may be necessary to converge, the finergrained approach might make faster progress, yielding more useful latent alignments. Finally, we consider a more expressive joint alignment that encourages continuity. Inspired by the sequential alignment of Niculae et al. (2018), we propose a bidirectional model called LPsequence, consisting of Sequence factor over the premise, with a possible state for each aligned word in the hypothesis, with a single transition score for every pair of alignments . By itself, this factor may align multiple premise words to the same hypothesis word, circumvented by Niculae et al. (2018) by running the optimization in both directions independently. Instead, we propose adding AtMostOne factors, like in Equation 21, ensuring each hypothesis word is aligned on average to at most one premise word. Effectively, this is like a sequence tagger allowed to use each of the states at most once. For both LPSparseMAP approaches, we rescale the result by row sums to ensure feasibility.
Results. Table 3 reveals that LPmatching is the best performing mechanism on SNLI, and LPsequential on MultiNLI. The transition score learned by LPsequential is 1.6 on SNLI and 2.5 on MultiNLI, and Figure 4 shows an example of the useful inductive bias it learns. On both datasets, the relaxed LPmatching outperforms the coarse matching factor, suggesting that, indeed, equivalent parametrizations of a model may perform differently when not run until convergence.
5.3 Multilabel classification
Finally, to confirm that LPSparseMAP is also suitable as in the supervised setting, we evaluate on the task of multilabel classification. Our factor graph has binary variables (one for each label), and a pairwise factor with a score for every label cooccurrence:
(22) 
bibtex  bookmarks  

Unstructured  42.28  35.76 
Structured hinge loss  37.70  33.26 
LPSparseMAP loss  43.43  36.07 
Neural network parametrization. We use a 2layer multilayer perceptron to compute the score for each variable. In the structured models, we have an additional parameters for the cooccurrence score of every pair of classes. We compare an unstructured baseline (using the binary logistic loss for each label), a structured hinge loss (with LPMAP inference) and a LPSparseMAP loss model. We solve LPMAP using AD^{3} and LPSparseMAP with our proposed algorithm, (cf. Appendix E).
Results. Table 4 shows the example score on the test set for the bibtex and bookmarks benchmark datasets (Katakis et al., 2008). The structured hinge loss model is worse than the unstructured (binary logistic loss) baseline; the LPSparseMAP loss model outperforms both. This suggests that the LPSparseMAP loss is promising for structured output learning. We note that, in strictlysupervised setting, approaches that blend inference with learning, such as (Chen et al., 2015; Tang et al., 2016) may be more efficient; however, LPSparseMAP can work outofthebox as a hidden layer as well.
6 Related work
Differentiable optimization. The most related research direction involves bilevel optimization, or argmin differentiation (Gould et al., 2016); Typically, such research assumes problems are expressible in a standard form, for instance using quadratic programs (Amos and Kolter, 2017) or disciplined convex programs, based on a conic reformulation (Agrawal et al., 2019a, b). Such approaches are not applicable for the typical optimization problems arising in structured prediction, because of the intractably large number of constraints typically necessary, and the difficulty of formulating many problems in standard forms. Our method instead assumes interacting through the problem through local oracle algorithms, exploiting the structure of the factor graph and allowing for more efficient handling of coarse factors (e. g., TREE) and logic constraints.
Latent structure models. Our motivation and applications are mostly focused on learning with latent structure. Specifically, we are interested in global optimization methods, which require marginal inference or similar relaxations (Kim et al., 2017; Liu and Lapata, 2018; Corro and Titov, 2019a, b; Niculae et al., 2018), rather than incremental methods based on policy gradients (Yogatama et al., 2017). Promising methods exist for approximate marginal inference in factor graphs with MAP calls (Belanger et al., 2013; Krishnan et al., 2015; Tang et al., 2016), relying on entropy approximation penalties. Such approaches focus on supervised structure prediction, which is not our main goal; and their backward passes has not been studied to our knowledge. Importantly, as these penalties are nonquadratic, the Active Set algorithm does not apply, falling back to the more general variants of FrankWolfe. Active Set is a key ingredient of our work, as it exhibits fast finite convergence, sparse solutions and  crucially  precomputation of the matrix inverse required in the backward pass (Niculae et al., 2018). Moreover, the backward pass of these methods has not been studied. Instead, the quadratic penalty pioneered by Niculae et al. (2018) is more amenable to optimization, as well as bringing other sparsity benefits. It may be tempting to directly apply SparseMAP with an approximate LPMAP oracle. The projection step of Peng et al. (2018) can be cast as a SparseMAP problem, thus our algorithm can be used to extend their method to arbitrary factor graphs. For pairwise MRFs (a class of factor graphs), differentiating belief propagation, either through unrolling or perturbationbased approximation, has been studied (Stoyanov et al., 2011; Domke, 2013). Our approach instead computes implicit gradients, which is more efficient, thanks to quantities precomputed in the forward pass, and in some circumstances has been shown to work better (Rajeswaran et al., 2019). Finally, none of these approaches can inherently handle logic constraints or coarse factors.
7 Conclusions
We introduced LPSparseMAP, an extension of SparseMAP to sparse differentiable optimization in any factor graph, enabling neural hidden layers with arbitrarily complex structure, specified using a familiar domainspecific language. We have shown LPSparseMAP to outperform SparseMAP for latent structure learning, and its corresponding loss function to outperform the structured hinge for structured output learning. We hope that our toolkit empowers future research on latent structure models, improving efficiency for smaller networks through inductive bias.
Supplementary Material
Appendix A Separable reformulation of LPSparseMAP
Lemma 1.
Let , , , defined as in Proposition 1. Let . Then,


;

;

For any feasible pair , , and
Proof.
(i) The matrix , which expresses the agreement constraint , is a stack of selector matrices, in other words, its subblocks are either the identity or the zero matrix . We index its rows by pairs , and its columns by . Denote by the fact that the ^{th} variable under factor is . Then, . We can then explicitly compute
If , , so .
(ii) By construction, for the unique variable with . Thus,
(iii) It follows from (i) and (ii) that .
(iv) Since is fullrank, the feasibility condition is equivalent to . Leftmultiplying by yields . Moreover, ∎
Appendix B Derivation of updates and comparison to LPMAP
Recall the problem we are trying to minimize, from Equation 11:
(23) 
Since the simplex constraints are separable, we may move them to the objective, yielding
(24) 
The augmented Lagrangian of problem 24 is
(25) 
The solution is a saddle point of the Lagrangian, i. e., a solution of
(26) 
ADMM optimizes Equation 26 in a blockcoordinate fashion; we next derive each block update.
b.1 Updating
We update for each independently by solving:
(27) 
Denoting , we have that
The augmented term regularizing the subproblems toward the current estimate of the global solution is
For each factor, the subproblem objective is therefore:
(28)  
This is exactly a SparseMAP instance with and .
Observation. For comparison, when solving LPMAP with AD^{3}, the subproblems minimize the objective
(29)  
so the update is a SparseMAP instance with and . Notable differences is the scaling by instead of (corresponding to the added regularization), and the diagonal degree reweighting.
b.2 Updating
We must solve
(30)  
This is an unconstrained problem. Setting the gradient of the objective to , we get
(31)  
with the unique solution