# A discriminative view of MRF pre-processing algorithms

## Abstract

While Markov Random Fields (MRFs) are widely used in computer vision, they present a quite challenging inference problem. MRF inference can be accelerated by pre-processing techniques like Dead End Elimination (DEE) [8] or QPBO-based approaches [18, 24, 25] which compute the optimal labeling of a subset of variables. These techniques are guaranteed to never wrongly label a variable but they often leave a large number of variables unlabeled. We address this shortcoming by interpreting pre-processing as a classification problem, which allows us to trade off false positives (i.e., giving a variable an incorrect label) versus false negatives (i.e., failing to label a variable). We describe an efficient discriminative rule that finds optimal solutions for a subset of variables. Our technique provides both per-instance and worst-case guarantees concerning the quality of the solution. Empirical studies were conducted over several benchmark datasets. We obtain a speedup factor of 2 to 12 over expansion moves [4] without preprocessing, and on difficult non-submodular energy functions produce slightly lower energy.

## 1 Pre-processing for MRF inference

We address the inference problem for pairwise Markov Random Fields (MRFs) defined over variables , where each is labeled from a discrete label set . The MRF can be viewed as a graph with a neighborhood system . To compute the MAP estimate we minimize the energy

(1) |

where and are unary terms and pairwise terms. MRFs are widely used in applications such as image segmentation, stereo, etc [14, 40]. Unfortunately the MRF inference problem is NP-hard even when (i.e. binary labels) [22].

Many algorithms involve some kind of pre-processing phase that seeks to determine the value of a subset of variables, thus reducing the complexity of the remaining combinatorial search problem. Pre-processing methods are commonly used in conjunction with graph cuts, a technique that achieves strong performance on both binary and multilabel MRF inference [40]. Graph cuts handle binary MRFs by reduction to min-cut, which is then solved via max-flow (see [3, 9] for reviews). The most widely used graph cut methods for multi-label MRFs are move-making techniques, which generate a new proposal at each iteration and reduce the multi-label problem into a series of binary subproblems (should each variable stick with the old label or switch to the proposed label) and then solved by max-flow/min-cut. Popular algorithms in this family include expansion moves [4] and their generalization to fusion moves [27].

The best known pre-processing methods are Dead End Elimination (DEE) [8] and QPBO [3, 21], but there are a number of others [18, 24, 25, 35, 36, 39, 42]. (Similar approaches are used for other NP-hard problems, a prominent example is Davis-Putnam’s pure literal rule for SAT [7].)

The key weakness of such methods is that they are inherently conservative,
since they only label variables whose value can be determined in every global
minimum. Yet the MRFs that occur in computer vision are so large that in
practice we almost never compute the actual global minimum.^{1}

As an example, consider a tiny 8-connected binary MRF with 9 variables (pixels), and suppose we wish to determine by pre-processing that the center pixel should be labeled with 0. In order to soundly compute this by DEE or QPBO, we need to establish that switching the center pixel from 1 to 0 will always decrease the energy, no matter what the configuration of the surrounding pixels. Yet as demonstrated in Table 1, there are local configurations that are quite unlikely.

1 | 0 | 1 |
---|---|---|

0 | ? | 0 |

1 | 0 | 1 |

1 | 1 | 1 |
---|---|---|

1 | ? | 1 |

1 | 1 | 1 |

### 1.1 Outline and contributions

We begin with a summary of related work, with an emphasis on DEE, QPBO and QPBO-based pre-processing techniques. In Section 3 we give our discriminative criterion for pre-processing, motivated by examples like Table 1, and provide efficient approximations for the key subproblems. The theoretical performance of our method is analyzed in Section 4, and experimental results are given in Section 5. Most proofs are deferred to the supplemental material, which also contains additional experimental results.

## 2 Related work

A popular approach to the inference problem is to find the optimal
labeling for a subset of the variables
[8, 13, 18, 20, 34, 35, 36, 39, 44].
A partial labeling that holds in every global minimizer is said to be
persistent [3].
Techniques like QPBO [3, 21] find an optimal partial
labeling by enforcing an even stronger condition: a partial labeling
that will decrease the energy if it is substituted into any complete
labeling.^{2}

To make these notions precise, we introduce the following notation.
A partial labeling is represented
by the subvector of indexed by . Let be the label space of . Given two partial
labelings and where , we define to be the composition of and .^{3}

Following [3], we can define persistency and autarky:

###### Definition 1.

A partial labeling is persistent if

(2) |

###### Definition 2.

A partial labeling is an autarky if

(3) |

Persistency is the key property for pre-processing, since it determines the optimal value of a subset of the variables and thus reduces the remaining combinatorial search problem. In general, though, checking for persistency is intractable [3]. All existing persistency algorithms appear to check the autarky property as a sufficient condition, which states that overwriting an arbitrary labeling with this partial labeling will reduce the energy.

### 2.1 MRF pre-processing algorithms

QPBO generalizes the binary graph cut reduction that uses max-flow
to find an optimal partial labeling
[3, 21, 33]. If the energy function is
submodular^{4}

There are also techniques directly finding optimal partial labeling for the multi-label case, but the computational costs for these methods are significant. Kovtun [24, 25] described an approach constructing a series of binary one-verse-the-rest auxiliary problems and solve each of them via graph cuts. MQPBO [18] and generalized roof duality [44] proposed generalizations of QPBO to multi-label MRFs.

Recently, Swoboda et. al. [39] use standard MRF inference algorithms to iteratively update the set of persistent variables. Shekhovtsov [35] formalized the problem to maximize the number of optimally labeled variables as an LP. They also proposed to combine these two approaches together which can take advantage of both of them [36]. The number of variables labeled by these approaches are significantly more than Kovtun’s approach and MQPBO. However, the running time of these approaches is significantly longer, since these approaches involve solving complex programming (either via standard MRF inference solver or LP solver) iteratively.

Dead End Elimination (DEE) [8] and the recent Persistency Relaxation (PR) algorithm [42] are the only existing method with cheaper computational costs than max-flow. DEE checks a local sufficient condition which only involves a single vertex and its adjacent edges. PR generalizes DEE to check a larger partial labeling, which gives improved results on standard benchmarks.

## 3 Discriminative pre-processing of MRFs

In computer vision, the MRF inference problem is almost never solved exactly. As a result, pre-processing methods that enforce soundness are far too conservative, since they leave a large number of variables unlabeled. If we view pre-processing as a binary classification problem (given a partial labeling , decide if it’s persistent), existing techniques ensure that there are no false positives (i.e., variables given a label must be part of every global minimum), but at the cost of multiple false negatives (i.e., variables that are left unlabeled).

First, we need some notation. Define

to be the energy change when we substitute by given the partial labeling for the variables not in . By expanding the definition of and cancelling terms, the Markov property of MRFs gives us a sum over terms only depending on , for and for with some such that (i.e., is adjacent to ). Let , and we can rewrite .

This allows us to rewrite the autarky property (3) as:

(4) |

The key issue is the universal quantification in Eq. 4. To ensure that a partial labeling presents in all global minimizer, we look at all possible values that the neighbors might have. For each of these, we check that any other assignment would increase the energy.

Yet this is obviously quite conservative. We now show the desired persistency property can be rewritten by only looking at assignments to the neighboring variables that occur in a global minimizer. Define be all possible configurations of in a global minimizer.

###### Lemma 3.

is persistent if and only if

(5) |

###### Proof.

The if direction is trivial: consider an arbitrary global minimizer , we have by definition. Suppose , we will have , which contradicts the assumption that is a minimizer. Therefore, we have , so it is persistent.

For the only if direction, suppose Eq. 5 is not true, then such that . We can expand to one minimizer such that . Since is persistent, we also know . Therefore, . Since is a minimum this inequality is an equality, hence is also a global minimum. This contradicts the assumption that is persistent, since . ∎

### 3.1 Discriminative criterion

Comparing Eq. 4 and Eq. 5, we immediately observe that the universal quantifier makes autarky a sound but stronger condition than persistency. Crucially, this suggests a discriminative criterion to trade off false positives against false negatives.

The high level idea is the following. Let be the set of neighbor configurations such that given them is always a better choice. When is large enough or covers the most important neighbor configurations, it’s very likely that we will have . This in turn implies is persistent, even though and we do not precisely know .

Formally, assume we have a ground truth distribution which is uniform over and otherwise. Then a sound condition to check persistency is . Of course, computing and is computationally intractable. So we use an estimated distribution that approximates . Looking back to Table 1, one would assume that the left configuration would not appear in , while the right one quite plausibly could; there should be a lower value for the left one but a higher value for the right. Our discriminative criterion for persistency is

(6) |

Here is the key parameter that controls the tradeoff between false positives and false negatives, as shown by the following (obvious) lemma.

###### Lemma 4.

For the same set of decision problems for persistency, we will never increase the number of false positives by increasing .

We now address the two crucial issues: how to choose to effectively approximate , and how to efficiently check Eq. 6.

### 3.2 Approximating

A trivial baseline is to treat each as equally important and set our approximation to be the uniform distribution over . In this special case, Eq. 6 is equivalent to count the number of neighbor configurations that satisfy . We expect to cover the unknown with high probability when is large enough.

A more elegant approach is to estimate the marginal probability of a particular assignment via the generative MRF model, and use this as our approximation for . This problem is well studied in the message passing literature, and is often solved by max-product loopy belief propagation (LBP) [31, 43]. An important special case is if we only use the initialization of LBP, . This makes a certain amount of intuitive sense: in the MRF energy functions that occur in computer vision it is well known that most of the weight comes from the unary terms [31], which provide a strong signal as to the optimal label for each variable.

More generally, we can define to be a fully independent distribution with , where is the message we have from the belief propagation algorithm. Since this is just an approximation, we would not need to pay the cost of running LBP to convergence. In our experiments, the more general approach does not seem to pay dividends, but other ways of estimating the marginals are worth investigating.

### 3.3 Efficiently checking our discriminative criterion

Checking Eq. 6 is generally computational intractable, due to the size of and . We now propose a polynomial time algorithm to compute a lower bound for .

We will focus on the persistency of a single variable from this point forward. This subroutine is used by our construction algorithm (which will be described in Section 3.4) to construct a partial labeling for the given energy function . However, our methods can handle an arbitrary for ; the details are deferred to the supplementary material, but are similar to the single variable case.

Our general strategy is to find a subset of which we know is inside and can be easily factorized. We start by considering each node independently. For each , define to be the set of labels where the autarky condition holds if . Since autarky is a stronger condition than persistency, we know that all values where are inside . The union of these sets across different will still be a subset of .

Formally, define . Then . Let . Then, we know that and .

We establish a computationally tractable lower bound for by the following lemma, which we can check instead.

###### Lemma 5.

We have the following lower bound:

(7) |

where .

###### Proof.

We can view as the probability given distribution .

Because our can be factorized independently, we can integrate over the variables other than to get .

We also have since are all disjoint. Then, using independence again, we have

(8) | ||||

Finally, note that we argued before, which concludes the proof. ∎

Constructing requires us to be able to efficiently check . We expand it by the definition of then swap the min and sum operators. This gives the following lower bound, which we check for being strictly positive:

(9) | ||||

### 3.4 Our algorithm

\RestyleAlgoboxruled \LinesNumbered{algorithm} \SetAlgoNoLine\KwInEnergy function ; \For to \For Compute \If ; With fixed, use one MRF inference algorithm to solve the remaining variables, get \Return

We have presented our discriminative criterion to decide if a given partial labeling is persistent. Now we will use it as a key subroutine to perform pre-processing for MRF inference, as shown in lines 2-10 of Algorithm 7. We firstly loop over the unlabeled variables and its label set (line 3). For each given , use our discriminative rule to judge if it’s persistent (line 4-8). We will fix its value if it passes our test by setting , and concatenate it with our inference result (line 6, 7). Note that fixing will also provide additional information as to the unlabeled variables which were checked before , so we repeat the whole procedure for iterations (line 2).

After our pre-processing has terminated and labeled the variables in the set , we fix the variables and use any MRF inference algorithms to solve the remaining energy minimization problem, which gives us a labeling on the remaining variables (line 11). Finally, we obtain our inference result by concatenating them together (line 12).

Running time We now give an asymptotic bound on the running time of our pre-processing algorithm here, deferring the analysis into the supplementary material. Assuming we have an oracle to give us data terms and prior terms in time. Let and be the number of variables, edges and maximum possible labels, and be the maximum degree of the graph. The overall running time is when we use Section 3.3 to check our discriminative criterion, and for brute force (which is feasible when both and are small constants).

## 4 Performance bounds

We can analyze the per-instance and worst-case performance of our pre-processing methods when followed by an inference algorithm that produces a solution with performance bounds.

### 4.1 Per-instance bounds

There are a number of MRF inference algorithms that produce per-instance guarantees (i.e., they produce a certificate after execution that their solution is close to the global minimum). These methods, which are typically based on linear programming, include [19, 23, 41], and they provide a per-instance additive error bound by computing the duality gap.

Our algorithm has a natural way to bound additive errors. Recall our notation describing the energy changes when we flip to with the neighbor configuration . Therefore, is the worst case energy decrement when we flip to arbitrary with arbitrary neighbor configurations . It’s non-positive since we can always set . Now we can negate it and define to be the maximum potential energy loss when we use our discriminative criterion to decide is persistent. Then we have the following two lemmas.

###### Lemma 6.

Let be the persistent variables found by our Algorithm 7. For arbitrary , and arbitrary , we have .

###### Proof.

With fixed, we flip to in the reverse order of them being added to by our algorithm. Due to the analysis before, we will lose at most at each step. ∎

###### Theorem 7.

Suppose the inference algorithm has per-instance -additive bound, then .

###### Proof.

Let be the minimizer of with fixed, which might be different than the global minimizer . Then we will have . The first step is because we use an inference algorithm with -additive errors to solve the problem with fixed. The second step follows because is the minimizer w.r.t. . ∎

As a special case, any sound condition like Eq. 4 guarantees , i.e., we don’t make mistakes. In practice it is computationally intractable to compute , so just as in Section 3.3 we swap the min and sum operators, and compute the upper bound efficiently. Then we use as our per-instance additive bound.

### 4.2 Worst case bounds

Some MRF inference algorithms produce a solution that is guaranteed to lie within a known factor of the global minimum. The best known such technique is the expansion move algorithm [4] but there are others [11, 17, 23]. We can easily turn our per-instance bounds into the worst case bounds. We combine Eq. 6 and as our discriminative criterion for pre-defined .

###### Corollary 8.

Suppose the inference algorithm has worst case -additive bound, then is our worst case additive bound.

Inference algorithm with worst case guarantees are usually multiplicative bounds other than additive bounds, but we can modify our proof of Theorem 7 to get the following bounds.

###### Theorem 9.

Suppose the inference algorithm has a worst case -multiplicative bound, then we will have .

###### Proof.

Following the proof of Theorem 7, we have . ∎

A more careful analysis can give us a tighter bound (dropping the coefficient before ), for the important special case where we use the expansion move algorithm [4] for inference. We defer the proof to the supplementary material.

###### Theorem 10.

Suppose we use expansion moves as the inference algorithm, with the -multiplicative bound, then we will have .

## 5 Experiments

### 5.1 Datasets and experimental setup

Approaches The most natural baselines for us to compare against include inference without pre-processing, and inference using the sound (but conservative) DEE [8] and PR [42] techniques. We employ expansion moves for MRF inference [4]. In order to achieve better speedup, we apply preprocessing to each induced binary subproblem of expansion moves as the input of DEE, PR or Algorithm 7, and then run QPBO [21].

At the other end of the spectrum are high overhead techniques such as Kovtun’s approach [24, 25], MQPBO [18], and LP-based approaches [35, 36, 39]. These algorithms require more running time than max-flow on each induced binary subproblem. Therefore, we apply them to the multi-label problem, and then use expansion move to infer the remaining part. We choose the IRI method [36] as the representative among [35, 36, 39] since it’s significantly faster. Note that the R [2] method also uses Kovtun’s method as their pre-processing (reduce) step in order to speed up MRF inference. The reuse and recycle parts attempt to speed up the inference algorithm itself, which is orthogonal to what we propose to do in this paper, so we do not compare against this method.

We also compared against other widely used MRF inference algorithms besides expansion moves, including loopy belief propagation (LBP) [31, 43], dual decomposition (DD) [15], TRWS [19] and MPLP [10, 37, 38]. The comparison among these inference algorithms are provided in survey papers [14, 40]. In our experiments, expansion moves is usually significantly faster than other methods, and gives comparable or better energy. These experimental comparisons are deferred to the supplemental material.

Dataset We conducted experiments on a variety of MRF inference benchmarks, where the energy minimization problems come from different vision problems, including color segmentation [26], stereo, image inpainting, denoising [40] and optical flow [6]. Datasets for the first three tasks are wrapped in OpenGM2 [14] and are available online. We use the BSDS300 [28] for the denoising task with the MRF setup following [40]. We use the MPI Sintel dataset [5] for the optical flow task with the MRF setup following [6].

Our focus, of course, is on the difficult inference problems where the induced binary subproblem is non-submodular. For comparison, we also included some experiments on relatively easy problems where the induced binary subproblem is submodular.

Measurement
We report the improvement in overall running time (including both
pre-processing and the inference for the remaining unlabeled variables) and
relative energy change.^{5}

We also report the percentage of labeled variables during the pre-processing. Since we view the decision problem (whether a given partial labeling is persistent) as a classification problem, we interchangeably use the term percentage of labeled variables and recall value. Getting the precision value is tricky. Since it’s a NP-hard problem so we cannot have the ground truth label for every variable. However, we apply our pre-processing technique to the binary subproblems induced from expansion moves. We know that either max-flow solves the subproblem exactly for the submodular cases or QPBO can find a sufficiently large subset of partial persistent labeling for the non-submodular cases (in our experiments, it labels almost all the variables). Therefore, we report the precision value of our method on the subset of the variables where we know the ground truth labeling.

Parameter setup and sensitivity analysis
The discriminative rule in our approach has a few parameters.
In order to achieve a fair comparison, we employed
leave-one-out cross-validation (see, e.g. [31]) to use all but one instances in
the same dataset as the validation set to choose the best parameter^{6}

There is evidence that our approach achieves good performance over a wide range of parameters. We observed that cross validation picked nearly identical parameters for every instance in the same dataset. Using nearby parameters also produced good results.

We also experimented with the following fixed parameters, to avoid the expense of cross-validation: , using the uniform distribution for and checking with Section 3.3. Note that this is a fairly conservative assumption, since we use the exact same parameters for very different energy functions, but still obtain good results. We acheive a 2x-12x speedup on different datasets with the energy increasing on the worst case. In addition, we still get lower energy on 4 of the 5 challenging dataset. We defer the details of our fixed parameter experiments to the supplementary material.

Dataset | Measurement | Ours | DEE | PR | Kovtun | MQPBO | IRI | ||
---|---|---|---|---|---|---|---|---|---|

Challenging Datasets | (non- Potts energy, large ) | Stereo | Speedup | 1.78x | 1.06x | 1.13x | N/A | MEM | 0.51x |

12–20 labels | Energy Change | -0.06% | 0.00% | 0.00% | N/A | MEM | -0.15% | ||

Trunc. L1/L2 | Labeled Vars | 44.76% | 10.07% | 18.06% | N/A | MEM | 56.45% | ||

Inpainting | Speedup | 3.40x | 1.28x | 1.32x | N/A | MEM | 0.12x | ||

256 labels | Energy Change | -1.71% | 0.00% | 0.00% | N/A | MEM | 0.00% | ||

Trunc. L2 | Labeled Vars | 74.29% | 21.05% | 23.75% | N/A | MEM | 0.36% | ||

Denoising-sq | Speedup | 11.83x | 1.20x | 1.37x | N/A | MEM | 0.29x | ||

256 labels | Energy Change | -0.02% | 0.00% | 0.00% | N/A | MEM | 0.00% | ||

L2 | Labeled Vars | 97.91% | 16.54% | 29.83% | N/A | MEM | 0.39% | ||

Denoising-ts | Speedup | 11.91x | 10.53x | 10.64x | N/A | MEM | 0.18x | ||

256 labels | Energy Change | 0.00% | 0.00% | 0.00% | N/A | MEM | -0.03% | ||

Trunc. L2 | Labeled Vars | 98.32% | 95.65% | 97.69% | N/A | MEM | 5.85% | ||

Optical Flow | Speedup | 4.69x | 2.63 | 3.40x | N/A | MEM | TO | ||

225 labels | Energy Change | -0.04% | 0.00% | 0.00% | N/A | MEM | TO | ||

L1 | Labeled Vars | 77.25% | 54.34% | 65.51% | N/A | MEM | TO | ||

Easy Datasets | (Potts, small ) | Color-seg-n4 | Speedup | 7.02x | 4.55x | 6.34x | 2.43x | 0.37x | 3.67x |

4–12 labels | Energy Change | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | -0.12% | ||

Potts | Labeled Vars | 85.74% | 65.38% | 77.50% | 70.32% | 17.27% | 98.44% | ||

Color-seg-n8 | Speedup | 8.33x | 5.61x | 6.37x | 2.33x | 0.32x | 1.45x | ||

4–12 labels | Energy Change | +0.04% | 0.00% | 0.00% | 0.00% | 0.00% | -0.10% | ||

Potts | Labeled Vars | 90.39% | 71.62% | 82.05% | 70.05% | 17.87% | 99.35% |

Dataset | Stereo | Inpainting | Denoising-sq | Denoising-ts | Optical Flow | Color-seg-n4 | Color-seg-n8 |

Precision | 99.74% | 96.16% | 99.95% | 99.79% | 99.88% | 99.79% | 99.77% |

0.7 | 0.8 | 0.9 | 1.0 | ||
---|---|---|---|---|---|

Stereo | P | 90.40% | 99.71% | 99.41% | 100.00% |

R | 91.31% | 56.77% | 11.35% | 9.26% | |

Inpainting | P | 95.11% | 99.88% | 99.96% | 100.00% |

R | 90.51% | 47.06% | 25.97% | 21.93% | |

Denoising-sq | P | 99.66% | 99.95% | 99.95% | 100.00% |

R | 99.47% | 97.52% | 19.11% | 15.15% | |

Denoising-ts | P | 99.75% | 99.95% | 99.99% | 100.00% |

R | 98.61% | 96.65% | 94.99% | 94.62% | |

Optical Flow | P | 94.01% | 99.50% | 99.98% | 100.00% |

R | 99.27% | 93.74% | 60.85% | 56.79% | |

Color-seg-n4 | P | 94.77% | 99.50% | 99.86% | 100.00% |

R | 98.52% | 90.80% | 77.20% | 66.65% | |

Color-seg-n8 | P | 99.48% | 99.76% | 99.87% | 100.00% |

R | 92.84% | 90.43% | 86.92% | 71.66% |

### 5.2 Results on benchmarks

We summarize our experimental results in Table 8. Our
primary goal is to speedup MRF inference on hard problems, and there is
evidence that our benchmarks are challenging. The state-of-the-art IRI
method, which delivers impressive performance on the easier problems in our
benchmarks, struggles with the harder problems^{7}

Our approach achieved a significant improvement, making expansion moves 2x to 12x faster on various datasets. Our pre-processing method beats its natural competitor DEE by around 2x, and outperforms all the baseline methods. Figure 1 shows a typical energy vs. time curve. We can see our approach drives the energy curve down much faster than the other methods.

The key factor for the speedup is the percentage of labeled variables. The values of these variables are fixed during the pre-processing step, resulting in a smaller problem for max-flow/QPBO to solve. Table 8 shows our approach labels significantly more variables than DEE and PR, especially on the inpainting and denoising-sq datasets. Kovtun, MQPBO and IRI have very expensive overhead as the pre-processing step. While it is impressive that IRI labels almost every variable on the easy dataset, it is still 2x-6x slower than our proposed method. Furthermore, Kovtun, MQPBO and IRI do not perform well on our challenging datasets. When the size of the label set is large (which is common in many vision problems such as inpainting, denoising or optical flow), even IRI only proves a few variables to be persistent after spending 3x-70x as much time as our method. This demonstrates the advantage of performing pre-processing on the binary subproblem, which is consistent with the observation in [42].

Our method also performs well in terms of energy, especially on the hard benchmarks. Because we can label some variables incorrectly during pre-processing, there is a risk of producing a larger energy. However, the experimental results are reassuring: on the hard problems we actually produce slightly lower energy, while on the easier problems we can produce slightly higher energy.

While it is somewhat counter-intuitive, occasionally labeling variables incorrectly can plausibly lead to a better overall energy by getting out of a local minimum. Expansion moves can be viewed as a local search algorithm although its search space has an exponential size [1]. Therefore, a random walk going uphill occasionally may help us escape from the local minimizer, as in the Metropolis algorithm [30] or simulated annealing [16]. At one iteration of the expansion move algorithm, our method may label some variables incorrectly and solve the binary subproblem suboptimally (i.e., our pre-processing may cause the energy to increase during expansion move framework). It is plausible that this suboptimal move for the binary subproblem may also help us escape from the local minimizer. To verify this hypothesis, we experimented with a variant of our method where we reject an expansion move if it makes the energy worse. In experiments, this change led to a worse final energy. This suggests that allowing suboptimal moves is beneficial.

We believe that our method achieves competitive energy due to the very high precision, shown in Table 3. It demonstrates that our discriminative rule described in Eq. 6 is effective and powerful despite being simple and intuitive. In general, by compromising precision a little bit, we can significantly boost the recall value, as illustrated in Table 4.

In summary, our proposed method achieves a high quality trade-off between running time and energy among all the methods, particularly on challenging datasets. It runs significantly faster than its competitors and achieves an energy that is similar and sometimes even lower.

### 5.3 Experiments with parameters and bounds

In Section 5.2, we set the parameter , and investigated how our algorithm performed without the worst case bound. We have demonstrated the proposed discriminative rule Eq. 6 itself is empirically effective. All the post-running per-instance bounds we proved in Section 4.1 are still sound, although in this variant of our method there is no worst case theoretical guarantee.

However, if we combine Eq. 6 and as our decision rule, as described in Section 4.2, we will have the worst case bounds. We also conducted experiments with different ’s. Our observation is that when , adding the rule has minimal effects on the speedup and energy we reported in Table 8, since our precision is already very high as shown in Table 4. However, it gives us a worst case theoretical guarantee. When , we can observe a noticeable improvement on precision and energy change when we decrease the value with other parameters fixed. As a special case, we have a sound condition again when . In general, decreasing increases the running time, but the tradeoffs involved are not obvious, and we defer details to the supplementary material.

Acknowledgments: This research was supported by NSF grants IIS-1447473 and IIS-1161860 and by a Google Faculty Research Award.

## 6 Outline of supplementary material

We will give a detailed running time analysis of our proposed algorithm in Section 7. Then we will give the proof to Lemma 4 and Theorem 10 in Section 8 and Section 9 respectively. Generalization of the efficient discriminative criterion check subroutine will be described in Section 10. More implementation details will be given in Section 11. Finally, we will provide more experimental data in Section 12, including visualization results, experimental results on a typical parameter setup, more investigation on parameters sensitivity, the role of worst case bound in practice and preliminary results on multilabel MRFs.

## 7 Running time analysis

\RestyleAlgoboxruled \LinesNumbered{algorithm} \SetAlgoNoLine\KwInEnergy function ; \For to \For Compute \If ; With fixed, use one MRF inference algorithm to solve the remaining variables, get \Return

The pseudo-code of our proposed algorithm is listed in Algorithm 7. It’s the same pseudo-code we have in the main paper.

We will give a asymptotic analysis on the running time of our pre-processing algorithm here. Assuming we have an oracle to give us data term and prior term value in time. Let and to be the number of variables, edges and maximum possible labels, is the maximum degree of the graph. For a typical vision problem, we usually have a sparse graph like grid, meaning and is also usually a small constant like 4 or 8.

Computation time of the for loop from line 2 to 10 needs some thinking. is usually a small constant, so we can omit it in the asymptotic analysis. For the given , a naive implementation of brute force algorithm to compute needs to enumerate all the possible neighboring configurations , and it takes to compute , so it takes time. Therefore, the overall running time is for brute force so it’s still feasible when both d and L are small constant.

When we use the approximated way to compute the lower bound using Lemma 5 in the main paper, we need an faster way to compute Eq. 9. We can pre-compute all the terms we may used here in time globally and then query it in time without solving the min operator each time. Then it takes time to compute , time to compute and to compute the sum each iteration. Also note that once we fix a variable, it also takes to update our pre-computations result. But each variable will only be fixed at most once during the pre-processing, so the amortized running time to update the pre-computations result is . So in sum, we have the overall running time for approximated calculation.

## 8 Proof of Lemma 4

###### Lemma 4.

For the same set of decision problems for persistency, we will never increase the number of false positives by increasing .

###### Proof.

This one is trivial. Consider any non-persistent , it will be a false positive with parameter if and only if it meets our discriminative criterion, i.e., . Now for the algorithm using parameter , our discriminative criterion still holds, hence it’s still a false positive for our algorithm with parameter . ∎

## 9 Proof of Theorem 10

###### Theorem 10.

Suppose we use expansion movess as the inference algorithm, with the -multiplicative bound, then we will have .

###### Proof.

Following the proof of the multiplicative bound of expansion moves algorithm [4] (Theorem 6.1), we will see actually the multiplicative factor will not be applied to unary terms. In other words, .

Note that in our algorithm, the energy function of expansion moves is induced by fixing in , all the pairwise terms crossing and could be viewed as the unary terms in since one variable will be fixed. Therefore, we will have following.

(10) | ||||

∎

## 10 Generalization of the efficient check of discriminative criterion

When we want to decide if the given partial labeling is persistent or not, we can follow exactly the same idea presented in Section 3.3 of the main paper to compute the lower bound of . The only big difference is that we need a subroutine to efficiently check for with . Persistency relaxation (PR) [42] generalizes dead end elimination (DEE) [8] from checking persistency of a single variable to an independent local minimum (ILM) partial labeling . The subproblem in PR is to decide if for , without the additional constraint that , and they proposed a bunch of sufficient conditions to efficiently check it. Actually, it’s trivial to enforce the additional constraint in those approaches. We just need to remove from the free variables and force it takes value in the subroutine proposed in PR. Note that those subroutines are sound so we can still apply Lemma 5 to partial labeling and get the lower bound of . Once we have our discriminative criterion as the decision subroutine, we can follow the construction algorithm in PR (Algorithm 2) as the generalization of our proposed construction algorithm in the main paper.

## 11 More implementation details

Since we applied the proposed method to each induced binary subproblem in the expansion moves algorithm, we only check persistency for (i.e., do not take move in the binary case) after the first epoch of running expansion moves algorithm in order to get the maximum speedup. We observed that after the first epoch, most of the variables won’t change its value, hence the extra benefit from checking persistent for is very marginal.

## 12 Additional experimental results

### 12.1 Visualization results

We presented the visualization results on the stereo task in Fig. 8. We can see there is no significant visual difference between the expansion moves results and our results, even in the case that our method has slightly higher energy. Therefore, it’s appealing to apply our method in practice, since it has almost the same visual quality but makes the inference much faster. When we set up a limited time budget in real applications, see the second column of Fig. 8, our approach can generate much better visual result than regular expansion moves algorithm without pre-processing. In this case, regular expansion moves even doesn’t finish its first epoch and has a very poor disparity map.

### 12.2 Experimental results with a typical parameter setup

Our experiments suggest that the proposed method can achieve good performance with the parameters in a wide range. We report the experimental results in Table 5 with the following fixed parameters to avoid the expense of cross-validation: , using the uniform distribution for and checking with our efficient subroutine described in Section 3.3 of the main paper.

Even though this is a fairly conservative assumption (we use the exact same parameters for very different energy functions), we still obtain good results. We acheive a 2x-12x speedup on different datasets with the energy increasing on the worst case. In addition, we still get lower energy on 4 of the 5 challenging dataset.

We also listed the performance of our method with the parameters selected with the leave-one-out cross validation procedure as a reference (shown in Table 2 in the main paper). We see that the performance of our method is very similar no matter whether we use fixed parameters or use cross validation to choose the parameters. The key observation of the main paper still holds even with this fixed typical parameter setup, i.e., our method achieves significant speedup against baseline methods with very minor compromise on the accuracy of the partial optical labelings (usually lose precision). We also achieve comparable or smaller energy even though we compromise the accuracy of the partial optical labelings in the pre-processing step.

Therefore, these experiments demonstrate that it’s sufficient to use the typical parameter setup of our method in practice. We can achieve very good performance without using the expensive cross validation parameter selection procedure.

Typical parameter setup (w/o cross validation) | |||||||

Dataset | Stereo | Inpainting | Denoising-sq | Denoising-ts | Optical Flow | Color-seg-n4 | Color-seg-n8 |

Speedup | 2.14x | 2.10x | 11.71x | 10.61x | 8.92x | 9.31x | 8.45x |

Energy Change | -0.04% | -0.53% | -0.03% | -0.09% | +0.11% | +0.01% | +0.05% |

Labeled Vars | 56.77% | 47.06% | 97.39% | 96.64% | 93.74% | 90.80% | 90.43% |

Precision | 99.71% | 99.88% | 99.95% | 99.95% | 99.50% | 99.50% | 99.76% |

Leave one out parameter selection (w/ cross validation) | |||||||