An Algorithmic Theory of Dependent RegularizersPart 1: Submodular Structure

An Algorithmic Theory of Dependent Regularizers
Part 1: Submodular Structure

Hoyt Koepke
Department of Statistics
University of Washington
Seattle, WA 98105
hoytak@stat.washington.edu

Marina Meila
Department of Statistics
University of Washington
Seattle, WA 98105
mmp@stat.washington.edu
Abstract

We present an exploration of the rich theoretical connections between several classes of regularized models, network flows, and recent results in submodular function theory. This work unifies key aspects of these problems under a common theory, leading to novel methods for working with several important models of interest in statistics, machine learning and computer vision.

In Part 1, we review the concepts of network flows and submodular function optimization theory foundational to our results. We then examine the connections between network flows and the minimum-norm algorithm from submodular optimization, extending and improving several current results. This leads to a concise representation of the structure of a large class of pairwise regularized models important in machine learning, statistics and computer vision.

In Part 2, we describe the full regularization path of a class of penalized regression problems with dependent variables that includes the graph-guided LASSO and total variation constrained models. This description also motivates a practical algorithm. This allows us to efficiently find the regularization path of the discretized version of TV penalized models. Ultimately, our new algorithms scale up to high-dimensional problems with millions of variables.

1 Introduction

High-dimensional data is a central focus of modern research in statistics and machine learning. Recent technological advances in a variety of scientific fields for gathering and generating data, matched by rapidly increasing computing power for analysis, has attracted significant research into the statistical questions surrounding structured data. Numerous models and computational techniques have emerged recently to work with these data types.

Our primary contribution is a theoretical framework enabling the efficient optimization of models that work with high-dimensional models in which the predictors are believed to be dependent, correlated, or sparse. In particular, we propose a new approach for estimation of certain types of dependent predictors, in which the model incorporates a prior belief that many of the predictors take similar or identical values. This has been a hot topic of research in recent years, with applications in genomics, image analysis, graphical models, and several other areas. However, efficient estimation involving structured and dependent predictors has proven to be quite challenging. Our main contribution, which includes a number of related theoretical results, is a theoretical framework and algorithmic approach that unlocks a large and particularly thorny class of these models.

After laying out the context for our research in the next section, we outline our main results as well as the needed background in section 3. Our contributions build on recent results in optimization theory and combinatorial optimization, which we describe in detail. In 4, we present a clear theoretical connection between submodular optimization theory, network flows, and the proximal operator in regularized models such as the graph-guided LASSO. Then, in 5, we extend the underlying submodular optimization theory to allow a weighted version of the underlying submodular problem. In the context of regularized regression, this allows the use of additional unary terms in the regularizer.

2 Overview of Statistical Models and Regularization

To begin, consider a simple regression model, in which we have observations and predictors , with

(1)

where are independent random noise vectors with ; typically, these are assumed to be i.i.d. Gaussian. In many high dimensional contexts, the estimation of may be problematic using classical methods. For example, may be much larger than , making the problem ill-posed as many possible values of map to the same response. ( and are used here instead of the more common and to be consistent with the optimization literature we connect this problem to.) Additionally, many predictors of interest may have a negligible or nearly identical affect on ; eliminating or grouping these predictors is then desirable. To handle this, a number of sophisticated approaches have been proposed to incorporate variable selection or aggregation into the statistical estimation problem.

One common and well-studied approach is to add a penalty term, or regularizer, to the log-likelihood that enforces some prior belief about the structure of (Bickel et al., 2006; Hastie et al., 2009). In this setup, estimating the predictor involves finding the minimizer of a log-likelihood term or loss function plus a regularization term :

(2)

where controls the strength of the regularization .

Many well studied and frequently used models fall into this context. For example, if is the log-likelihood from a multivariate Gaussian distribution, one of the oldest regularization techniques is the -norm, which gives us the classic technique of ridge regression (Hoerl and Kennard, 1970), where

(3)

In this example, the regularization term is typically used as an effective way of dealing with an ill-conditioned inverse problem in the least squares context or as a simple way of preventing from over-fitting the model (Bishop, 2006).

In the Bayesian context, (2) often corresponds directly to finding the maximum a posteriori estimate in the classic likelihood-prior formulation, i.e.

(4)
(5)

where and are hyperparameters controlling the behavior of the prior distribution. In this formulation, the prior captures the belief encoded by the regularization term. For example, the ridge regression problem of (3) corresponds to using a standard multivariate Gaussian likelihood but assumes a -mean spherical Gaussian prior distribution over .

More recently, the Least Absolute Shrinkage and Selection Operator (LASSO), is used frequently to promote sparsity in the resulting estimator (Tibshirani, 1996; Hastie et al., 2005, 2009). This approach uses the -norm as the regularization term, i.e.

(6)

The LASSO problem has been analyzed in detail as a method for simultaneous estimation and variable selection, as the penalty tends to set many of the predictors to . The consistency and theoretical properties of this model, for both estimation and variable selection, are well studied (Meinshausen and Yu, 2009; Bunea et al., 2007, 2006; Van De Geer, 2008; van de Geer, 2007; Bickel et al., 2006). Many variants of this problem have also been proposed and analyzed. These include the elastic net, in which , with ; this generalizes both ridge regression and LASSO (Zou and Hastie, 2005). Data-adaptive versions of these estimators are analyzed in (Zou, 2006; Zou and Hastie, 2005; Zou and Zhang, 2009). In addition, there are many efficient algorithms to quickly solve these problems, which is one of the reasons they are used frequently in practice (Friedman et al., 2008a, 2010, 2007; Efron et al., 2004; Beck and Teboulle, 2009).

2.1 Graph Structured Dependencies

While there are other possible regularization strategies on individual parameters, of most interest to us is the incorporation of correlation structures and dependencies among the predictors into the prior distribution or regularization term. In many cases, this translates into penalizing the differences of predictors. This has recently been attracting significant algorithmic and theoretical interest as a way to effectively handle dependency structures in the data.

In our particular context, we are interested in the MAP estimate of models with Markov Random Field prior. The resulting log-linear models consist of a collection of unary and pairwise terms that capture the dependency structure of the problem. In particular, the priors we are interested in enforce similarity between neighboring variables, where “neighbor” is defined according to the graph structure of the MRF prior. The type of dependency we are looking at is determined by the

The simplest form of model dependency in the parameters is captured by the Fused LASSO problem (Tibshirani et al., 2005), which penalizes pairwise differences between terms in an ordered problem to be

(7)

where control the balance between the -norm controlling variable sparsity and the sum of ordered differences that effectively penalizes changepoints. This problem has gained some recent attention in the statistics community in the context of non-parametric regression (Dümbgen and Kovac, 2009; Cho and Fryzlewicz, 2011; Davies and Meise, 2008), group sparsity (Tibshirani et al., 2005; Bleakley and Vert, 2011), and change-point detection (Bleakley and Vert, 2011). From the optimization perspective, several algorithms to find the solution to this problem have been proposed; we refer the reader to Liu et al. (2010); Ye and Xie (2011); Bach et al. (2012) or Friedman et al. (2007) for discussions of this particular problem.

The generalized problem, in which the pairwise terms are not required to be ordered, is of significant practical interest in both the statistics and machine learning communities. Here, the pairwise interactions are controlled with an arbitrary graph of weighted difference penalties:

(8)

where controls the sparsity of the individual predictors, and penalizes differences, typically between predictors known to be correlated. This problem has become known as the graph guided LASSO Yu (2013).

This model – often with – arises in the context of “roughness” penalized regression in which differences between neighboring regions are penalized (Belkin et al., 2004). Besag, Green, Higdon, and Mengersen (1995) examined pairwise interactions on Markov random fields, which describe a prior distribution on the edges of an undirected graph; the proposed models were then solved with MCMC methods. Similar problems also arise in the estimation of sparse covariance matrices as outlined by Friedman, Hastie, and Tibshirani (2008b); there, structure learning is the goal and the pairwise interaction terms are not formed explicitly.

Similarly, Kovac and Smith (2011) analyzes this model as an approach to nonparametric regression on a graph, although with the simpler case of . He proposes an efficient active-region based procedure to solve the resulting optimization problem. In the genomics community, a version of (8) has been proposed as the Graph-Guided Fused LASSO problem by Kim, Sohn, and Xing (2009). There, the pairwise interactions were estimated from the correlation structure in the quantitative traits. Their model is essentially identical to this one, though with . There, the authors used a general quadratic solver to tackle the problem and showed good statistical performance in detecting genetic markers. The use of recent results in optimization, in particular proximal operator methods (Nesterov, 2007), have been proposed as a way of optimizing this model Chen, Kim, Lin, Carbonell, and Xing (2010, 2012); Bach, Jenatton, Mairal, and Obozinski (2012). The resulting algorithms, discovered independently, are similar to the ones we propose, although we improve upon them in several ways.

Some recent approaches focus on approximate methods with provable bounds on the solution. Nesterov proposed a smooth approximation approach which can be applied to the graph-guided LASSO in Nesterov (2007). More recently, Yu (2013) proposed a non-smooth approximation method that decomposes the summation in (8).

A very similar model to (8) is proposed in Sharma, Bondell, and Zhang (2013); there, however, the authors include an additional regularization term penalizing . In addition, they give several adaptive schemes for choosing the weights in a way that depends on the correlation structures of the design matrix. This work is noteworthy in that the authors give a detailed analysis of the asymptotic convergence and estimation rate of the model when the regularization weights are chosen adaptively. Their optimization method, however, involves a simple quadratic programming setup, which can severely limit the size of the problems for which their estimator can be used.

In the context of regularized regression with pairwise dependencies, our contribution is a theoretical treatment and several algorithms for a class of models that generalizes (8). Namely, we allow the penalty in the norm to be replaced with an arbitrary convex piecewise linear function . In particular, we examine the estimator

(9)

which generalizes many of the above models. While our primary result is a representation of the theoretical structures underlying this optimization problem, our work motivates efficient and novel algorithms for working with these types of structure. The theory we develop leads to proofs of the correctness of our algorithms. In addition, we are able to present an algorithmic description that gives the entire regularization path for one version of this problem.

2.2 Total Variation Models

Total variation models are effectively an extension of the above pairwise regularization schemes to the estimation of continuous functions. In this context, we wish to estimate a function in which the total variation of – the integral of the norm of the gradient – is controlled. This model is used heavily in developing statistical models of images (Chambolle et al., 2010) and the estimation of density functions (Bardsley and Luttman, 2009), but arises in other contexts as well.

When dealing with total variation models, we work theoretically with general -measurable functions with bounded variation. Our response becomes a function , and our predictor becomes a function . However, in practice, we discretize the problem by working with the functions and only on a discrete lattice of points, as is common in these problems.

Let be a compact region, and let be an -integrable continuously differentiable function defined on . In this context, the total variation of is given by

(10)

More general definitions exist when is not continuously differentiable Ambrosio and Di Marino (2012); Giusti (1984); for simplicity, we assume this condition. Let represent the space of -integrable continuously differentiable functions of bounded variation. Here, bounded variation can be taken as the condition

(11)

Again, much more general definitions exist, but this suffices for our purposes.

Formally, then, the total variation problem seeks to find an estimation function that to a function under a constraint or penalty on the total norm of the gradient of .

(12)
(13)

This model – often called the Rudin-Osher-Fatimi (ROF) model after the authors – was originally proposed in Rudin et al. (1992) as a way of removing noise from images corrupted by Gaussian white noise; however, it has gained significant attention in a variety of other areas.

In the image analysis community, it has been used often as a model for noise removal, with the solution of (10) being the estimation of the noise-free image. In addition, the value of the total variation term in the optimal solution is often used as a method of edge detection, as the locations of the non-zero gradients tend to track regions of significant change in the image.

Beyond image analysis, a number of recent models have used total variation regularization as a general way to enforce similar regions of the solution to have a common value. In this way, it is a generalization of the behavior of the Fused LASSO in one dimension; it tends to set regions of the image to constant values. As such, it is used in MRI reconstruction, electron tomography (Goris et al., 2012; Gao et al., 2010), MRI modeling (Michel et al., 2011; Keeling et al., 2012), CT reconstruction (Tian et al., 2011), and general ill-posed problems (Bardsley and Luttman, 2009).

(a) Original Image
(b) minimized image with .
(c) minimized image with .
Figure 1: An example of Total Variation Minimization for noise removal on Leonardo da Vinci’s Mona Lisa. Different values of the regularization parameter produce different results, with the higher value of smoothing the image less but removing less of the noise.

The total variation regularizer tends to have a smoothing effect around regions of transition, while also setting similar regions to a constant value. As such, it has proven to be quite effective for removing noise from images and finding the boundaries of sufficiently distinct regions. The locations where the norm of the gradient is nonzero correspond to boundaries of the observed process in which the change is the greatest. A rigorous treatment of the theoretical aspects surrounding using this as a regularization term can be found in Bellettini, Caselles, and Novaga (2002); Ring (2000) and Chan and Esedoglu (2005). A complete and treatment of this topic in practice can be found in a number of surveys, including Darbon and Sigelle (2006); Allard (2007, 2008, 2009); in particular, see Caselles, Chambolle, and Novaga (2011) and Chambolle, Caselles, Cremers, Novaga, and Pock (2010). As our purpose in this work is to present a treatment of the underlying theoretical structures, we refer the reader to one of the above references for use practical image analysis. Still, several points are of statistical interest, which we discuss now.

Several generalizations of the traditional Rudin-Osher-Fatimi model have been proposed for different models of the noise and assumptions about the boundaries of the images. Most of these involve changes to the loss or log-likelihood term but preserve the total variation term as the regularizer. In general, we can treat as the estimation of

(14)

where is a smooth convex loss function.

One option is to use loss for , namely

(15)

as a way of making the noise more robust to outliers. This model is explored in Bect et al. (2004); Chan and Esedoglu (2005) and the survey papers above. These perform better under some types of noise and have also become popular (Goldfarb and Yin, 2009).

Similarly, many models for image denoising – often when the image comes from noisy sensing processes in physics, medical imaging, or astronomy – involve using a Poisson likelihood for the observations. Here, we wish to recover a density function where we observe Poisson counts in a number of cells with rate proportional to the true density of the underlying process. Total variation regularization can be employed to deal with low count rates. Here, total variation regularization is used to recover the major structures in the data. In this case, the optimal loss function is given by

(16)

Research on this, including optimization techniques, can be found in Bardsley and Luttman (2009); Bardsley (2008); Sawatzky et al. (2009).

Also of interest in the statistics community, Wunderli (2013) replaces the squared-norm fit penalty in (13) with the quantile regression penalty from Koenker and Bassett Jr (1978), where loss function is replaced with

(17)

for which the author proves the existence, uniqueness, and stability of the solution.

One of the variations of total variation minimization is the Mumford-Shah model. In this model, used primarily for segmenting images, regions of discontinuity are handled explicitly Mumford and Shah (1989). In this model, the energy which is minimized is

(18)

where is a collection of curves and is the total length of these curves. Thus the function is allowed to be discontinuous on ; these are then taken to be the segmentations of the image. This model has also attracted a lot of attention in the vision community as a way of finding boundaries in the image (Chan and Vese, 2000; Pock et al., 2009; El Zehiry et al., 2007). The Bayesian model of this is as a type of mixture model in which the mean function and noise variance differs between regions (Brox and Cremers, 2007). Unlike general TV regularization, however, it is typically focused exclusively on finding curves for image segmentation, rather than on estimating the original image. While it shares a number of close parallels to our problem, particularly in the graph cut optimizations used (El Zehiry et al., 2007), our method does not appear to generalize to this problem.

The optimization of the total variation problem has garnered a lot of attention as well. An enormous number of approaches have been proposed for this problem. These largely fall into the category of graph based algorithms, active contour methods, or techniques based around partial differential equations. These are discussed in more depth in part 2 of this paper, where we mention them in introducing our approach to the problem.

3 Background and Outline of Main Results

The primary purpose of our work is to present a theory of the underlying structure connecting the optimization of the above models over dependent variables and recent results in combinatorial optimization, particularly submodular function minimization. Our theory connects and makes explicit a number of connections between known results in these fields, many of which we extend in practically relevant ways. The insights gained from our theory motivate a family of novel algorithms for working with these models. Furthermore, we are able to give a complete description of the structure of the regularization path; this result is also completely novel.

One primary practical contribution is the development of methods to efficiently optimize functions of the following form, which we denote as ()

()

with , is a non-negative regularization parameter, controls the regularization of dependent variables, and is a convex piecewise linear function, possibly the common penalty .

We develop a novel algorithm for () based on network flows and submodular optimization theory, and completely solve the regularization path as well over . The crux of the idea is to construct a collection of graph-based binary partitioning problems indexed by a continuous parameter. With our construction, we show that the points at which each node exactly gives the solution to (). We develop algorithms for this that scale easily to millions of variables, and show that the above construction also unlocks the nature of the regularization path.

These functions arise in two active areas of interest to the statistics and machine learning communities. The first is in the optimization of penalized regression functions where we wish to find

()

where is a collection of observed response and is a smooth, convex log-likelihood term. Throughout our work, we denote this problem as (). In the general case, we handle this problem through the use of the proximal gradient methods, a variant of sub-gradient methods, which we described below. The inner routine involves solving ().

Finally, () also extends to the continuous variational estimation problem of Total Variation minimization from section 2.2. where we wish to minimize

()

where and are defined in (10) – (13). We show in the second part of our work that this problem can be solved using the methods developed for (). Thus we not only present an efficient algorithm for (), we also present the first algorithm for finding the full regularization path of this problem.

3.1 Outline

This paper is laid out as followed. The rest of this section presents a survey of the background to the current problem, describing the current state of the research in the relevant areas and laying out the reasons it is of interest to the statistics and machine learning communities.

Section 4 lays out the basic framework connecting network flows, submodular function minimization, and the optimization of (). While some of our results were discovered independently, our theory pulls them together explicitly in a common framework, and provides more direct proofs of their correctness. Additionally, our approach motivates a new algorithm to exactly solve the resulting optimization problem.

Section 5 presents entirely novel results. We extend the basic theory of section 4 using a novel extension lemma to allow for a general size-biasing measure on the different components of the input. The theoretical results are proved for general submodular functions, extending the state-of-the-art in this area. In the context of our problem, we use these results to extend the algorithmic results of section 4. In particular, this opens a way to handle the term in () above, significantly extending the results of Mairal et al. (2011) related to section 4.

3.2 Optimization of Regularized Models

Regularized models have gained significant attention in both the statistics and machine learning communities. Intuitively, they provide a way of imposing structure on the problem solution. This facilitates the accurate and tractable estimation of high-dimensional variables or other features, such as change points, that are difficult to do in a standard regression context. Furthermore, these methods have been well studied both from the theoretical perspective – correct estimation is guaranteed under a number of reasonable assumptions – and the optimization perspective – efficient algorithms exist to solve them. Not surprisingly, these approaches have gained significant popularity in both the statistics and machine learning communities. We will describe several relevant examples below.

In the general context, we consider finding the minimum of

(19)

where is a smooth convex Lipshitz loss function, is a regularization term, and controls the amount of the regularization. In statistical models, is typically given by the log-likelihood of the model. The regularization term is required to be convex but is not required to be smooth; in practice it is is often a regularization penalty that enforces some sort of sparseness or structure. A common regularization term is simply the norm on , i.e.

(20)

which gives the standard LASSO problem proposed in (Tibshirani, 1996) and described in section 2.

A number of algorithms have been proposed for the optimization of this particular problem Friedman et al. (2010); Efron et al. (2004). Most recently, a method using proximal operators has been proposed called the Fast Iterative Soft-Threshold Algorithm (Beck and Teboulle, 2009); this achieves theoretically optimal performance – identical to the lower bound of the convergence rate by iteration of optimization of general smooth, convex functions (Nesterov, 2005). The method is based on the proximal operators we discuss next, and is described in detail, along with theoretical guarantees, in Beck and Teboulle (2009). This is the method that immediately fits with our theoretical results.

3.2.1 Proximal Operators

Proximal gradient methods Nesterov (2007); Beck and Teboulle (2009) are a very general approach to optimizing (19). At a high level, they can be understood as a natural extension of gradient and sub-gradient based methods designed to deal efficiently with the non-smooth component of the optimization. The motivation behind proximal methods is the observation that the objective is the composition of two convex functions, with the non-smooth part isolated in one of the terms. In this case, the proximal problem takes the form

(21)

where we abbreviate . is the Lipshitz constant of , which we assume is differentiable. We rewrite this as

(22)

The minimizing solution of (22) forms the update rule. Intuitively, at each iteration, proximal methods linearize the objective function around the current estimate of the solution, , then update this with the solution of the proximal problem. As long as (21) can be optimized efficiently, the proximal operator methods give excellent performance.

While this technique has led to a number of algorithmic improvements for simpler regularizers, it has also opened the door to the practical use of much more complicated regularization structures. In particular, there has been significant interest in more advanced regularization structures, particularly ones involving dependencies between the variables. These types of regularization structures are the primary focus of our work.

3.2.2 Related Work

Finally, of most relevance to our work here, is a series of recent papers by Francis Bach and others, namely Bach, Jenatton, Mairal, and Obozinski (2012), Mairal, Jenatton, Obozinski, and Bach (2010), and Mairal, Jenatton, Obozinski, and Bach (2011), that independently discover some of our results. In these papers, combined with the more general work of Bach, Jenatton, Mairal, and Obozinski (2011); Bach (2010b) and Bach (2010a, 2011), several of the results we present here were independently proposed, although from a much different starting point. In particular, he proposes using parametric network flows to solve the proximal operator for and that this corresponds to calculating the minimum norm vector of the associated submodular function. These results are discussed in more detail in section 4.

While our results were discovered independently from these, we extend them in a number of important ways. First, we propose a new algorithm to exactly solve the linear parametric flow problem; this is a core algorithm in these problems. Second, through the use of the weighting scheme proposed in section 5, we develop a way to include general convex linear functions directly in the optimization procedure. This expands the types of regularization terms that can be handled as part of the proximal operator scheme.

3.3 Combinatorial Optimization and Network Flows

The final piece of background work we wish to present forms the last pillar on which our theory is built. Core to the theory and the algorithms is the simple problem of finding the minimum cost partitioning of a set of nodes in which relationships between these nodes are defined by a graph connecting them. Our results are built on a fundamental connection between a simple combinatorial problem – finding a minimum cost cut on a graph – and the convex optimization problems described in section 3.2.

A fundamental problem in combinatorial optimization is finding the minimum cost cut on a directed graph. This problem is defined by , where is a set of nodes and is a set of directed edges connecting two nodes in . We consistently use to denote the number of nodes and, without loss of generality, assume , i.e. the nodes are refered to using the first counting numbers. Similarly, the edges are denoted using pairs of vertices; thus . Here, denotes an edge in the graph going from to . For an undirected graph, we simply assume that if and only if , and that for all . associates a cost with each of the edges in . I.e. maps from a pair of edges to a non-negative cost, i.e. for .

The letters and here denote specific nodes not in the node set . For reasons that will become clear in a moment, is called the source node and is called the sink node. We treat and as valid nodes in the graph and define the cost associated with an edge from to as ; analogously, denotes the cost of an edge from to the sink . These nodes are treated specially in the optimization, however; the minimum cut problem is the problem of finding a cut in the graph separating and such that the total cost of the edges cut is minimal. Formally, we wish to find a set satisfying

(23)

where with a capital returns the set of minimizers, as there may be multiple partitions achieving the minimum cost.

The above lays out the basic definitions needed for our work with graph structures. This problem is noteworthy, however, as it can be solved easily by finding the maximal flow on the graph – one of the fundamental dualities in combinatorial optimization is the fact that the cut edges defining the minimum cost partition are the saturated edges in the maximum flow from to in the equivalent network flow problem. It is also one of the simplest practical examples of a submodular function, another critical component of our theory. We now describe these concepts.

3.3.1 Network Flows

The network flow problem is the problem of pushing as much “flow” as possible through the graph from to , where the capacity of each edge is given by the cost mapping above. The significance of this problem is the fact that it map directly to finding the minimum cost partition in a graph (Dantzig and Fulkerson, 1955; Cormen et al., 2001; Kolmogorov and Zabih, 2004). The saturating edges of a maximizing network flow – those edges limiting any more flow from being pushed through the graph – defines the optimal cut in the minimum cut partitioning. This result is one of the most practical results of combinatorial optimization, as many problems map to the partitioning problem above, and one and simple algorithms exist for solving network flow problems.

In the network flow problem, we wish to construct a mapping that represents “flow” from to . A mapping is a valid flow if, for all nodes in , the flow going into the each node is the same as the flow leaving that node. Specifically,

Definition 3.1 ((Valid) Flow).

Consider a graph as described above. Then , , is a flow on if

(24)
(25)

where for convenience, we assume that and if .

Furthermore, we say that an edge is saturated if , i.e. the flow on that edge cannot be increased.

Since we frequently talk about flows in a more informal sense, we use the term “valid flow” to reference this formal definition.

It is easy to show that the amount of flow leaving is the same as the amount of flow leaving , i.e.

(26)

this total flow is what we wish to maximize in the maximum flow problem, i.e. we wish to find a maximal flow such that

(27)

The canonical min-cut-max-flow theorem (Dantzig and Fulkerson, 1955; Cormen et al., 2001) states that the set of edges saturated by all possible maximal flows defines a minimum cut in the sense of (23) above. The immediate practical consequence of this duality is that we are able to find the minimum cut partitioning quickly as fast algorithms exist for finding a maximal flow . We outline one of these below, but first we generalize the idea of a valid flow in two practically relevant ways.

3.3.2 Preflows and Pseudoflows

Two extensions of the idea of a flow on a graph relaxes the equality in the definition of a flow. Relaxing the inequality is key to several algorithms, and forms some interesting connections to the theory we outline. The equality constraint is effectively replaced with an function that gives the excess flow at each node; if the total amount of flow into and out of a node is equal, then . Formally,

Definition 3.2 (Preflow).

Consider a graph as in definition 3.1. Then , , is a flow on if

(28)
(29)

With this definition, we define the function as

(30)

It is easy to see that if is a preflow, for all nodes . Maintaining a valid preflow is one of the key invariants in the common push-relabel algorithm discussed below.

A Pseudoflow is the weakest definition of a flow; it relaxes (29) completely, allowing there to be both excesses and deficits on the nodes of the graph. However, it does add in an additional constraint, namely that all edges connected to the source and sink are saturated. Formally,

Definition 3.3 (Pseudoflow).

Consider a graph as in definition 3.1. Then , , is a flow on if

(31)
(32)
(33)

Note that now the function can be either positive or negative. A pseudoflow has some interesting geometric and algorithmic properties, It was described in Hochbaum (1998) and an efficient algorithm for solving network flows based on pseudoflows was presented in Hochbaum (2008). We mention it here to preview our results in section 4: the pseudoflow matches up directly with the base polytope, one of the fundamental structures in submodular function optimization.

3.3.3 Network Flow Algorithms

Finding the minimum cut or the maximum flow on a network is one of the oldest combinatorial optimization problems, and, as such, it is also one of the most well studied. It would be impossible to detail the numerous algorithms to solve the network flow problem here – Schrijver (2003) lists over 25 survey papers on different network flow algorithms. Many of these are tailored for different types of graphs (sparse vs. dense) and other variations of the problem.

Perhaps the most well-known algorithm for Network flow analysis is the push-relabel method. Variants of it achieve the best known performance guarantees for a number of problems of interest. It is simple to implement; essentially, each node has a specific height associated with it that represents (informally) the number of edges in the shortest unsaturated path to the sink. Each node can have an excess amount of flow from the source. At each iteration, a node with excess either pushes flow along an unsaturated edge to a neighbor with a lesser height, or increments its height. Termination occurs when the source node is incremented to , where is the number of nodes – at this point, there is no possible path to the sink and the maximum cut can be found. For more information on this, along with other information on this algorithm, see Cormen et al. (2001) or Schrijver (2003). This algorithm runs at the core of our network flow routines. This algorithm is guaranteed to run in polynomial time; in the general case, it runs in time, where is the set of vertices and is the set of edges. However, it can be improved to using dynamic tree structures (Schrijver, 2003).

3.3.4 The Parametric Flow Problem

One variation of the regular network flow problem is the parametric flow problem (Gallo et al., 1989). The general parametric flow problem is a simple modification of the maximum flow problem given above, except that now is replaced with a monotonic non-decreasing function ,

(34)

Gallo et al. (1989) showed that this problem could be solved for a fixed sequence in time proportional to the time of a single run of the network flow problem by exploiting a nestedness property of the solutions. Namely, as increase, the minimum cut of the optimal solution moves closer on the graph to the source. More formally, if , then

(35)

where is an optimal cut at in the sense of (23).

If is a linear function with positive slope, we call this problem the the linear parametric flow problem. In sections 4 and 5, we develop an exact algorithm for this problem that does not require a predetermined sequence of to solve. This method is one of the key routines in the algorithms for general optimization.

3.3.5 Connections to Statistical Problems

One of the recent applications of graph partitioning is in finding the minimum energy state of a binary Markov random field. In particular, if we can model the pairwise interactions over binary variables on the graph using unary and pairwise potential functions and , then the minimum energy solution can be found using the maximum cut on a specially formulated graph, provided that for all . This application of network flow solvers has had substantial impact in the computer vision community, where pairwise potential functions satisfying this condition are quite useful.

We discuss the details of this application in section 4, where we start with the theory behind these connections as first building block of our other results. Ultimately, we show that network flow algorithms can be used to solve not only this problem, but the much more general optimization problems described in section 3.2 as well. Through these connections, we hope to bring the power of network flow solvers into more common use in the statistical community.

3.4 Submodular Functions

The problem of finding a minimal partition in a graph is a special case of a much larger class of combinatorial optimization problems called submodular function minimization. Like the problem of finding a minimal-cost partitioning of a graph given in (23), this optimization problem involves finding the minimizing set over a submodular function , where denotes the collection of all subsets of a ground set . Given this, is submodular if, for all sets ,

(36)

This condition is seen as the discrete analogue of convexity. It is sufficient to guarantee that a minimizing set can be found in polynomial time complexity.

An alternative definition of submodularity involves the idea of diminishing returns (Fujishige, 2005). Specifically is submodular if and only if for all , and for all ,

(37)

This property is be best illustrated with a simple example. Let be subsets of a larger set , and define

(38)

so measures the coverage of the set . In this context, it is easy to see that satisfies (37).

is an example of a monotone submodular function as adding new elements to is only going to increase the value of . However, many practical examples do not fall into this category. In particular, the graph cut problem given in equation (23) above is non-monotone submodular:

(39)

We examine this particular example in more detail in section 4, where we prove is indeed submodular.

Submodular function optimization has gained significant attention lately in the machine learning community, as many other practical problems involving the sets can be phrased as submodular optimization problems. It has been used for numerous applications in computer vision, language modeling (Lin and Bilmes, 2010, 2012), clustering (Narasimhan, Jojic, and Bilmes, 2005; Narasimhan and Bilmes, 2007), computer vision (Jegelka and Bilmes, 2011, 2010), and many other domains. This field is quite active, both in terms of theory and algorithms, and we contribute some novel results to both areas in section 5.

The primary focus of our work has been developing a further connection between these graph problems and submodular optimization theory. Our approach, however, is the reverse of much of the previous work. Ultimately, we attempted to map the network flow problems back to the submodular optimization problems. Surprisingly, this actually opened the door to several new theoretical results for continuous optimization, and, in particular, to efficient solutions of the optimization problems given in section 2.

3.4.1 Some Formalities

The theory surrounding submodular optimization, and combinatorial optimization in general, is quite deep. Many of the results underlying submodular optimization require a fairly substantial tour of the theory of the underlying structures; good coverage of these results is found in (Fujishige, 2005) and (Schrijver, 2003). Most of these results are not immediately relevant to our work, so we leave them to the interested reader. However, several additional results are needed for some of the proofs we use later.

As with the discussion of problem, we assume that ; our notation intentionally matches that of the minimum cut problem defined above and is consistent throughout our work. We refer to as the ground set. Formally, can be restricted to map from a collection of subsets of , which we refer to consistently as , with . In the context of submodular functions, must be closed under union and intersection and include as an element(Fujishige, 2005). In general, and except for some of our proofs in section 5, can be thought of as .

3.4.2 Submodular Function Optimization

The first polynomial time algorithm for submodular function optimization was described in (Grötschel et al., 1993); it used the ellipsoid method from linear programming (Chvátal, 1983). While sufficient to prove that the algorithm can be solved in polynomial time, it was impractical to use on any real problems. The first strongly polynomial time algorithms – polynomial in a sense that does not depend on the values in the function – were proposed independently in Schrijver (2000) and Iwata, Fleischer, and Fujishige (2001). The algorithm proposed by Schrijver runs in , where refers to the complexity of evaluating the function. The latter algorithm is , which may be better or worse depending on . The weakly polynomial version of this algorithm runs in , where is the difference between maximum and minimum function values. Research in this area, however, is ongoing – a strongly polynomial algorithm that runs in has been proposed by Orlin (2009).

In practice, the minimum norm algorithm – also called the Fujishige-Wolfe Algorithm – is generally much faster, although it does not have a theoretical upper bound on the running time (Fujishige, 2005). We discuss this algorithm in detail, as it forms the basis of our work. However, there are clear cases that occur in practice where this algorithm does not seem to improve upon the more complicated deterministic ones – in Jegelka, Lin, and Bilmes (2011), a running time of was reported. Thus the quest for practical algorithms for this problem continues; it is an active area of research.

Additionally, several other methods for practical optimization have been proposed for general submodular optimization or for special cases that are common in practice. Stobbe and Krause (2010) proposed an efficient method for submodular functions that can be represented as a decomposable sum of smaller submodular functions given by , where is convex. In this particular case, the function can be mapped to a form that permits the use of nice numerical optimization techniques; however, many submodular functions cannot be minimized using this technique, and it can also be slow (Jegelka et al., 2011). Along with analyzing the deficiencies of existing methods, Jegelka, Lin, and Bilmes (2011) propose a powerful approach that relies on approximating the submodular functions with a sequence of graphs that permit efficient optimization. This method is quite efficient on a number of practical problems, likely because the graph naturally approximates the underlying structure present in many real-world problems.

Most recently, in Iyer, Jegelka, and Bilmes (2013), another practical optimization method is proposed; at its core is a framework for both submodular minimization and maximization based on a notion of discrete sub-gradients and super-gradients of the function. While not having a theoretical upper bound itself, it performs efficiently in practice. This method is also noteworthy in that it can constrain the solution space in which other exact solvers operate, providing substantial speedups.

3.4.3 Geometrical Structures and the Minimum Norm Algorithm

A number of geometrical structures underpin the theory of submodular optimization. In particular, an associated polymatroid is defined as the set of points in -dimensional Euclidean space with sums of sets of the dimensions constrained by the function value of the associated set.

For notational convenience, for a vector and set , define

(40)

In this way, forms a type of unnormalized set measure.

Now the polymatroid associated with is defined as

(41)

is fundamental to the theory behind submodular function optimization.

In our theory, we work primarily with the base of the polymatroid , denoted by . This is the extreme -dimensional face of ; it is defined as

(42)

In the case where the domain , is a linear, convex compact set. Thus it is often refered to as the Base Polytope of as it is compact.

The base is particularly important for our work, as one of the key results from submodular function optimization is the minimum norm algorithm, which states effectively that sets minimizing are given by the sign of the point in closest to the origin. This surprising result, while simple to state, takes a fair amount of deep theory to prove for general ; we state it here:

Theorem 3.4 ((Fujishige, 2005), Lemma 7.4.).

For submodular function defined on , let

(43)

and let and . Then both and minimize over all subsets . Furthermore, for all such that , .

A central aspect of our theory relies on this result. In general, it as one way of working the geometric structure of the problem. It turns out that in the case of graph partitioning, takes on a particularly nice form. This allows us to very quickly solve the minimum norm problem here. We show the resulting minimum norm vector also gives us the full solution path over weighting by the cardinality of the minimizing set. While this basic result was independently discovered in Mairal, Jenatton, Obozinski, and Bach (2011), our approach opens several doors for theoretical and algorithmic improvements, which we outline in the subsequent sections.

4 The Combinatorial Structure of Dependent Problems

The foundational aspect of our work is the result established in this section, namely an exploration of network flow minimization problems in terms of their geometric representation on the base polytope of a corresponding submodular problem. This representation is not new; it is explored in some depth as an illustrative example in (Fujishige, 2005) and connected to the minimum norm problems in Mairal et al. (2011). Our contribution, however minor, is based on a simple transformation of the original problem that yields a particularly intuitive geometric form. The fruit of this transformation, however, is a collection of novel results of theoretic and algorithmic interest; in particular, we are able to exactly find the optimal over the problem

(44)

Recall that this problem was discussed in depth in section 3.2; here, is given, and and the weights control the regularization.

In this section, we lay the theoretical foundation of this work, denoting connections to other related or previously known results. The contribution at the end is a strongly polynomial algorithm for the solution of a particular parametric flow problem; this algorithm follows naturally from this representation. In the next section, we extend the theory to naturally allow a convex piecewise-linear function to be included in (44) to match ().

Before presenting the results for the real-valued optimization of (44), we must first present a number of theoretical results using the optimization over sets . We begin with the simplest version of this optimization – finding the minimizing partition of a graph – and extend this result to general submodular functions later.

4.1 Basic Equivalences

We proceed by defining and establishing equivalences between three basic forms of the minimum cut problem on a graph. Recall from section 3.3 that the minimum cut problem is the problem of finding a set satisfying

(45)

As we mentioned earlier, several other important problems can be reduced to this form; in particular, finding the lowest energy state of a binary Markov random field, when the pairwise potential functions satisfy the submodularity condition, is equivalent to this problem. Our task now is to make this explicit.

We here show equivalences between three versions of the problem. The first is , which gives the standard energy minimization formulation, i.e. finding the MAP estimate of

(46)

which is equivalent to finding

(47)

gives the formulation as a quadratic binary minimization problem; here, the problem is to find that minimizes for an matrix . This version forms a convenient form that simplifies much of the notation in the later proofs. Finally, we show it is equivalent to the classic network flow formulation, denoted . This states the original energy minimization problem as the minimum -cut on a specially formulated graph structure. The equivalence of these representations is well known (Kolmogorov and Zabih, 2004) and widely used, particularly in computer vision applications. We here present a different parametrization of the problem which anticipates the rest of our results.

The other unique aspect of our problem formulation is the use of a size biasing term; specifically, we add a term to the optimization problem, where can be positive or negative. This term acts similarly to a regularization term in how it influences the optimization, but to think of it this way would lead to confusion as the true purpose of this formulation is revealed later in this section – ultimately, we show equivalence between the values of at which the set membership of a node flips and the optimal values of the continuous optimization problem of ().

Theorem 4.1.

Let be an undirected graph, where is the set of vertices (assume ) and is the set of edges. Without loss of generality, assume that . Define , , and as the sets of optimizing solutions to the following three problems, respectively:

Energy Minimization: .

Given an energy function defined for each vertex and a pairwise energy function defined for each edge , with , let

(48)

and let

(49)
Quadratic Binary Formulation: .

Given and as in , define the matrix as:

(50)
(51)

Suppose for , and let

(52)

and let

(53)
Minimum Cut Formulation: .

Let be an augmented undirected graph with , where and represent source and sink vertices, respectively, and . Define capacities , on the edges as:

(54)

where

(55)

Then the set of minimum cut solutions is given by

(56)

Then , , and are equivalent in the sense that any minimizer of one problem is also a minimizer of the others; specifically,

(57)
Proof.

A reformulation of known results (Kolmogorov and Zabih, 2004), but given in section A for convenience. ∎

The primary consequence of this theorem is that solving the energy minimization problem can be done efficiently and exactly due to several types of excellent network flow solvers that make solving problems with millions of nodes routine (Boykov and Kolmogorov, 2004; Cormen et al., 2001; Schrijver, 2003). Because of this, numerous applications for graphcuts have emerged in recent years for computer vision and machine learning. Our purpose, in part, is to expand the types of problems that can be handled with network flow solvers, and problems of interest in statistics in particular.

In our work, we alternate frequently between the above representations. For construction, the energy minimization problem is nicely behaved. In the theory, we typically find it easiest to work with the quadratic binary problem formulation due to the algebraic simplicity of working in that form. Again, however, each of these is equivalent; when it does not matter which form we use, we refer to the problem an solution set as and respectively.

4.1.1 Connections to Arbitrary Network Flow Problems

While theorem 4.1 lays out the equivalence between and and a specific form of network flow problem , for completeness we show that any network flow problem can be translated into the form and thus and . The two distinguishing aspects of are the facts that each node is connected to either the source or the sink, and that all the edges are symmetric, i.e. for all . It is thus sufficient to show that an arbitrary flow problem can be translated to this form. For conciseness, assume that ; the results can be adapted for other easily.

Theorem 4.2.

Any minimum cut problem on an arbitrary, possibly directed graph can be formulated as a quadratic binary problem of the form as follows:

  1. For all edges such that , add a path with capacity . That edge is now undirected in the sense that both directions have the same capacity, and the edges in the minimum cut are unchanged, as these paths will simply be eliminated by flow along that path.

  2. Set for and for .

  3. Given , set

Using these steps, any minimum cut problem can be translated to in the sense that the set of minimizing solutions is identical.

Proof.

Network flows are additive in the sense that increasing or decreasing the capacity of each edge in any path from to by a constant amount does not change the set of minimizing solutions of the resulting problem, even if new edges are added (Cormen et al., 2001; Schrijver, 2003). Thus step 1 is valid. The rest follows from simple algebra. ∎

4.2 Network Flows and Submodular Optimization

Recall from section 3.4 that the minimum cut problem is a subclass of the more general class of submodular optimization problems. In the context of , it is easy to state a direct proof of this fact, additionally showing that here the submodularity of the pairwise terms is also necessary for general submodularity.

Theorem 4.3.

The minimization problem can be expressed as minimization of a function , with

(58)

Then is submodular if and only if .

Proof.

One immediate proof of the if part follows from the fact that can be expressed as the sum of submodular pairwise potential terms, and the sum of pairwise submodular functions is also submodular (Fujishige, 2005). The direct proof, including both directions, is a simple reformulation of known results see (Kolmogorov and Zabih, 2004), but given in section A on page A for convenience. ∎

4.2.1 Geometric Structures

We are now ready to present the theory that explicitly describes the geometry of , which extends to both and , in the context of submodular function optimization. This theory is not new; several abstract aspects of it have been thoroughly explored by Schrijver (2003) and Fujishige (2005). In our case, however, the exact form of the problem presented in theorem 4.1 was carefully chosen to yield nice properties when this connection is made explicit. From these, a number of desirable properties follow immediately.

The rest of this section is arranged as follows. First, we show that the base polytope is a reduction of all pseudoflows (see definition 3.3) on the form of the cut problem from theorem 4.1. , described in section 3.4, has special characteristics for our purposes, as the minimum norm algorithm described in section 3.4 provides a convenient theoretical tool to examine the structure of . Our key result is to show that the minimum norm vector – the -projection of the origin onto – depends on only through a simple, constant shift. Thus this vector allows us to immediately compute the minimum cut directly for any , and we are thus able to compute the minimum cut solution as well for any as well.

Theorem 4.4 (Structure of ).

Let be defined in theorem 4.3 (58), and let

(59)

Let , , be defined as follows:

(60)

and denote . Then the base of the polymatroid associated with , , is given by

(61)

and the full polymatroid polytope is given by

(62)
Proof.

Proceeds with straightforward albeit tedious algebra. Proved in section A on page A.

In light of this, the minimum norm problem on the graph structure is as follows. The immediate corollary to the min-norm theorem is that yields the optimum cut of the corresponding network flow problem, :

Theorem 4.5 (Minimum Norm Formulation of ).

Let and be given by equations (59) and (60), respectively. Then the min-norm problem associated with is defined by

(63)

Then any optimal cut solving , as given in theorem 4.1, satisfies:

(64)
Proof.

Follows immediately from theorems 4.1 and 4.3 characterizing the cut problem as a submodular function optimization problem, theorem 4.4 describing the structure of this problem, and theorem 3.4 to characterize the solution. ∎

The optimal in the above formulation has some surprising consequences that motivate the rest of our results. In particular, we show that the optimal in equation (63) is independent of ; this is the key observation that allows us to find the entire regularization path over . Formally, this result is given in theorem 4.9. However, we need other results first.

4.2.2 Connection to Flows

The representation in terms of is significant partly as the values of effectively form a pseudoflow in the sense of Hochbaum (2008) (see section 3.3.2). Recall that a pseudoflow extends the concept of a flow by allowing all the nodes to have both excesses and deficits. In addition, a pseudoflow assumes that all edges from the source to nodes in the graph are saturated, possibly creating excesses at these nodes, and all edges connected to the sink are similarly saturated, possibly creating deficits at these nodes.

Theorem 4.6.

Consider the problem . For any , with , let represent the flow on each edge , with