Snake: a Stochastic Proximal Gradient Algorithm for Regularized Problems over Large Graphs

Snake: a Stochastic Proximal Gradient Algorithm for Regularized Problems over Large Graphs

Adil Salim, Pascal Bianchi and Walid Hachem A. Salim and P. Bianchi are with the LTCI, Télécom ParisTech, Université Paris-Saclay, 75013, Paris, France (adil.salim, Hachem is with the CNRS / LIGM (UMR 8049), Université Paris-Est Marne-la-Vallée, France ( work was supported by the Agence Nationale pour la Recherche, France, (ODISSEE project, ANR-13-ASTR-0030) and by the Labex Digiteo-DigiCosme (OPALE project), Université Paris-Saclay.The authors are grateful to TeraLab DataScience for their material support.
July 20, 2019

A regularized optimization problem over a large unstructured graph is studied, where the regularization term is tied to the graph geometry. Typical regularization examples include the total variation and the Laplacian regularizations over the graph. When applying the proximal gradient algorithm to solve this problem, there exist quite affordable methods to implement the proximity operator (backward step) in the special case where the graph is a simple path without loops. In this paper, an algorithm, referred to as “Snake”, is proposed to solve such regularized problems over general graphs, by taking benefit of these fast methods. The algorithm consists in properly selecting random simple paths in the graph and performing the proximal gradient algorithm over these simple paths. This algorithm is an instance of a new general stochastic proximal gradient algorithm, whose convergence is proven. Applications to trend filtering and graph inpainting are provided among others. Numerical experiments are conducted over large graphs.

I Introduction

Many applications in the fields of machine learning [1, 2, 3], signal and image restoration [4, 5, 6], or trend filtering [7, 8, 9, 10, 11, 12] require the solution of the following optimization problem. On an undirected graph with no self loops, where represents a set of nodes () and is the set of edges, find


where is a convex and differentiable function on representing a data fitting term, and the function represents a regularization term of the form

where is a family of convex and symmetric functions. The regularization term will be called a -regularization in the sequel. These -regularizations often promotes the sparsity or the smoothness of the solution. For instance, when where is a vector of positive weights, the function coincides with the weighted Total Variation (TV) norm. This kind of regularization is often used in programming problems over a graph which are intended to recover a piecewise constant signal across adjacent nodes [9, 8, 10, 11, 12, 13, 14, 15]. Another example is the Laplacian regularization or its normalized version obtained by rescaling and by the degrees of each node in respectively. Laplacian regularization tends to smoothen the solution in accordance with the graph geometry [1, 2].

The Forward-Backward (or proximal gradient) algorithm is one of the most popular approaches towards solving Problem (1). This algorithm produces the sequence of iterates


where is a fixed step, and where

is the well-known proximity operator applied to the proper, lower semicontinuous (lsc), and convex function (here is the standard Euclidean norm). When satisfies a smoothness assumption, and when is small enough, it is indeed well-known that the sequence converges to a minimizer of (1), assuming this minimizer exists.

Implementing the proximal gradient algorithm requires the computation of the proximity operator applied to at each iteration. When is large, this computation is in general affordable only when the graph exhibits a simple structure. For instance, when is the TV norm, the so-called taut-string algorithm is an efficient algorithm for computing the proximity operator when the graph is one-dimensional (1D) [16] (see Figure 1) or when it is a two-dimensional (2D) regular grid [13]. Similar observations can be made for the Laplacian regularization [17], where, e.g., the discrete cosine transform can be implemented. When the graph is large and unstructured, these algorithms cannot be used, and the computation of the proximity operator is more difficult ([8, 18]).

This problem is addressed in this paper. Towards obtaining a simple algorithm, we first express the functions and as the expectations of functions defined on a random walks in the graph, paving the way for a randomized version of the proximal gradient algorithm. Stochastic online algorithms in the spirit of this algorithm are often considered as simple and reliable procedures for solving high dimensional machine learning problems, including in the situations where the randomness is not inherent to these problems [19, 20]. One specificity of the algorithm developed here lies in that it reconciles two requirements: on the one hand, the random versions of should be defined on simple paths, i.e., on walks without loops (see Figure 1), in a way to make benefit of the power of the existing fast algorithms for computing the proximity operator. Owing to the existence of a procedure for selecting these simple paths, we term our algorithm as the “Snake” algorithm. On the other hand, the expectations of the functions handled by the optimization algorithm coincide with and respectively (up to a multiplicative constant), in such a way that the algorithm does not introduce any bias on the estimates.

Fig. 1: Left: General graph on which is colored the simple path 3-1-0-6-7. Right: 1D-graph.

There often exists efficient methods to compute the proximity operator of -regularization over 1D-graphs. The algorithm Snake randomly selects simple paths in a general graph in order to apply the latter 1D efficient methods over a general graph.

Actually, the algorithm Snake will be an instance of a new general stochastic approximation algorithm that we develop in this paper. In some aspects, this last general stochastic approximation algorithm is itself a generalization of the random Forward-Backward algorithm studied in [21].

Before presenting our approach, we provide an overview of the literature dealing with our problem. First consider the case where coincides with the TV norm. As said above, fast methods exist when the graph has a simple structure. We refer the reader to [13] for an overview of iterative solvers of Problem (1) in these cases. In [22], the author introduces a dynamical programming method to compute the proximity operator on a 1D-graph with a complexity of order . Still in the 1D case, Condat [16] revisited recently an algorithm that is due to Mammen and Van De Geer [23] referred to as the taut-string algorithm. The complexity of this algorithm is in the worst-case scenario, and in the most realistic cases. The taut-string algorithm is linked to a total variation regularized problem in [24]. This algorithm is generalized to 2D-grids, weighted TV norms and TV norms by Barbero and Sra in [13]. To generalize to 2D-grids, the TV regularization can be written as a sum of two terms on which one can apply 1D methods, according to [25] and [26]. Over general graphs, there is no immediate way to generalize the taut string method. The problem of computing the TV-proximity operator over a general graph is addressed in [8].

The authors of [8] suggest to solve the problem using a projected Newton algorithm applied to the dual problem. It is observed that, empirically, this methods performs better than other concurrent approaches. As a matter of fact, this statement holds when the graph has a moderate size. As far as large graphs are concerned, the iteration complexity of the projected Newton method can be a bottleneck. To address this problem, the authors of [14] and [3] propose to solve the problem distributively over the nodes using the Alternating Direction Method of Multipliers (ADMM).

In [12] the authors propose to compute a decomposition of the graph in 1D-graphs and then solve Problem (1) by means of the TV-proximity operators over these 1D-graphs. Although the decomposition of the graph is fast in many applications, the algorithm [12] relies on an offline decomposition of the whole graph that needs a global knowledge of the graph topology. The Snake algorithm obtains this decomposition online. In [11], the authors propose a working set strategy to compute the TV-proximity operator. At each iteration, the graph is cut in two well-chosen subgraphs and a reduced problem of (1) is deduced from this cut. The reduced problem is then solved efficiently. This method has shown speed-ups when is an image (i.e a two dimensional grid). Although the decomposition of the graph is not done during the preprocessing time, the algorithm [11] still needs a global knowledge of the graph topology during the iterations. On the contrary, the Snake algorithm only needs a local knowledge. Finally, in [9], the authors propose to replace the computation of the TV-proximity operator over the graph by the computation of the TV-proximity operator over an 1D-subgraph of well chosen. This produces an approximation of the solution whereas the Snake algorithm is proven to converge to the exact solution.

In the case where is the Laplacian regularization, the computation of the proximity operator of reduces to the resolution of a linear system where is the Laplacian matrix of the graph and the identity matrix. On an 1D-graph, the latter resolution can be done efficiently and relies on an explicit diagonalization of ([17]) by means of the discrete cosine transform, which take operations. Over general graphs, the problem of computing the proximity operator of the Laplacian regularization is introduced in [2]. There exist fast algorithms to solve it due to Spielman and Teng [27]. They are based on recursively preconditioning the conjugate gradient method using graph theoretical results [18]. Nevertheless, the preconditionning phase which may be demanding over very large graphs. Compared to [18], our online method Snake requires no preprocessing step.

Ii Outline of the approach and paper organization

The starting point of our approach is a new stochastic optimization algorithm that has its own interest. This algorithm will be presented succinctly here, and more rigorously in Sec. III below. Given an integer , let be a random vector where the are valued in some measurable space. Consider the problem


where the are convex and differentiable, and the are convex. Given , define the operator . Given a sequence of independent copies of , and a sequence of positive steps , we consider the algorithm



and where stands for the composition of functions: . In other words, an iteration of this algorithm consists in the composition of random proximal gradient iterations. The case where was treated in [21].

Assuming that the set of minimizers of the problem is non empty, Th. 1 below states that the sequence converges almost surely to a (possibly random) point of this set. The proof of this theorem which is rather technical is deferred to Sec. VII. It follows the same canvas as the approach of [21], with the difference that we are now dealing with possibly different functions and non-independent noises for .

We now want to exploit this stochastic algorithm to develop a simple procedure leading to a solution of Problem (1). This will be done in Sec. IV and will lead to the Snake algorithm. The first step is to express the function as the expectation of a function with respect to a finite random walk. Given an integer and a finite walk of length on the graph , where and , write

Now, pick a node at random with a probability proportional to the degree (i.e., the number of neighbors) of this node. Once this node has been chosen, pick another one at random uniformly among the neighbors of the first node. Repeat the process of choosing neighbors times, and denote as the random walk thus obtained. With this construction, we get that using some elementary Markov chain formalism (see Prop. 2 below).

In these conditions, a first attempt of the use of Algorithm (4) is to consider Problem (1) as an instance of Problem (3) with , , and . Given an independent sequence of walks having the same law as and a sequence of steps in , Algorithm 4 boils down to the stochastic version of the proximal gradient algorithm


By Th. 1 (or by [21]), the iterates converge almost surely to a solution of Problem (1).

However, although simpler than the deterministic algorithm (2), this algorithm is still difficult to implement for many regularization functions. As said in the introduction, the walk is often required to be a simple path. Obviously, the walk generation mechanism described above does not prevent from having repeated nodes. A first way to circumvent this problem would be to generate as a loop-erased walk on the graph. Unfortunately, the evaluation of the corresponding distribution is notoriously difficult. The generalization of Prop. 2 to loop-erased walks is far from being immediate.

As an alternative, we identify the walk with the concatenation of at most simple paths of maximal length that we denote as , these random variables being valued in the space of all walks in of length at most :

Here, in the most frequent case where the number of simple paths is strictly less than , the last ’s are conventionally set to a trivial walk, i.e a walk with one node and no edge. We also denote as the length of the simple path , i.e., the number of edges in . We now choose , and for , we set and if , and otherwise. With this construction, we show in Sec. IV that and that the functions and fulfill the general assumptions required for the Algorithm (4) to converge to a solution of Problem (1). In summary, at each iteration, we pick up a random walk of length according to the procedure described above, we split it into simple paths of maximal length, then we successively apply the proximal gradient algorithm to these simple paths.

After recalling the contexts of the taut-string and the Laplacian regularization algorithms (Sec. V), we simulate Algorithm (4) in several application contexts. First, we study the so called graph trend filtering [8], with the parameter defined in [8] set to one. Then, we consider the graph inpainting problem [15, 1, 2]. These contexts are the purpose of Sec. VI. Finally, a conclusion and some future research directions are provided in Sec. VIII.

Iii A General Stochastic Proximal Gradient Algorithm

Notations. We denote by a probability space and by the corresponding expectation. We let be an arbitrary measurable space. We denote some Euclidean space and by its Borel -field. A mapping is called a normal convex integrand if is -measurable and if is convex for all [28].

Iii-a Problem and General Algorithm

In this section, we consider the general problem


where is a positive integer, the are random variables (r.v.), and the functions and satisfy the following assumption:

Assumption 1.

The following holds for all :

  1. The and are normal convex integrands.

  2. For every , and .

  3. For every , is differentiable. We denote as its gradient w.r.t. the first variable.

  • In this paper, we assume that the functions have a full domain for almost all . This assumption can be relaxed with some effort, along the ideas developed in [21].

    For every and every , we introduce the mapping defined by

    We define by

    Let be the random vector with values in and let be a sequence of i.i.d. copies of , defined on the same probability space . For all . Finally, let be a positive sequence. Our aim is to analyze the convergence of the iterates recursively defined by:


    as well as the intermediate variables () defined by , and


    In particular, .

    In the special case where the functions , are all constant with respect to (the algorithm is deterministic), the above iterations were studied by Passty in [29]. In the special case where , the algorithm boils down to the stochastic Forward-Backward algorithm, whose detailed convergence analysis can be found in [21] (see also [30], and [31] as an earlier work). In this case, the iterates take the simpler form


    and converge a.s. to a minimizer of under the convenient hypotheses.

    It is worth noting that the present algorithm (7) cannot be written as an instance of (9). Indeed, the operator is a composition of (random) operators, whereas the stochastic forward backward algorithm (9) has a simpler structure. This composition raises technical difficulties that need to be specifically addressed.

    Iii-B Almost sure convergence

    We make the following assumptions.

    Assumption 2.

    The positive sequence satisfies the conditions

    (i.e., ). Moreover,

    Assumption 3.

    The following holds for all :

    1. There exists a measurable map s.t. the following holds -a.e.: for all in ,

    2. For all , .

    We denote by the set of minimizers of Problem (6). Thanks to Ass. 1, the qualification conditions hold, ensuring that a point belongs to iff

    The (sub)differential and the expectation operators can be interchanged [32], and the above optimality condition also reads


    where is the Aumann expectation of the random set , defined as the set of expectations of the form , where is a measurable map s.t. is integrable and


    Therefore, the optimality condition (10) means that there exist integrable mappings satisfying (11) and s.t.


    When (11)-(12) hold, we say that the family is a representation of the minimizer . In addition, if for some and every , and , we say that the minimizer admits a -integrable representation.

    Assumption 4.
    1. The set is not empty.

    2. For every , there exists s.t. admits a -integrable representation .

    We denote by the least norm element in .

    Assumption 5.

    For every compact set , there exists such that for all ,

    We can now state the main result of this section, which will be proven in Sec. VII.

    Theorem 1.

    Let Ass. 15 hold true. There exists a r.v. s.t. and s.t. converges a.s. to as . Moreover, for every , converges a.s. to .

    Iv The Snake Algorithm

    Iv-a Notations

    Let be an integer. We refer to a walk of length over the graph as a sequence in such that for every , the pair is an edge of the graph. A walk of length zero is a single vertex.

    We shall often identify with the graph whose vertices and edges are respectively given by the sets and .

    Let We denote by the set of all walks over with length This is a finite set. Let be the set of all subsets of We consider the measurable space

    Let with We abusively denote by the family of functions We refer to the regularization of as the regularization on the graph of the restriction of to that is

    Besides, is defined to be if is a single vertex (that is ).

    We say that a walk is a simple path if there is no repeated node i.e, all elements in are different or if is a single vertex. Throughout the paper, we assume that when is a simple path, the computation of can be done easily.

    Iv-B Writing the Regularization Function as an Expectation

    One key idea of this paper is to write the function as an expectation in order to use a stochastic approximation algorithm, as described in Sec. III.

    Denote by the degree of the node , i.e., the number of neighbors of in . Let be the probability measure on defined as

    Define the probability transition kernel on as if , and otherwise, where is the indicator function.

    We refer to a Markov chain (indexed by ) over with initial distribution and transition kernel as an infinite random walk over Let be a infinite random walk over defined on the canonical probability space with The first node of this walk is randomly chosen in according to the distribution The other nodes are drawn recursively according to the conditional probability . In other words, conditionally to , the node is drawn uniformly from the neighborhood of . Setting an integer , we define the random variable from as

    Proposition 2.

    For every ,


    It is straightforward to show that is an invariant measure of the Markov chain . Moreover, , leading to the identity

    which completes the proof by symmetry of . ∎

    This proposition shows that Problem (1) is written equivalently


    Hence, applying the stochastic proximal gradient algorithm to solve (14) leads to a new algorithm to solve (1), which was mentioned in Sec. II, Eq. (5):


    Although the iteration complexity is reduced in (15) compared to (2), the computation of the proximity operator of the -regularization over the random subgraph in the algorithm (15) can be difficult to implement. This is due to the possible presence of loops in the random walk . As an alternative, we split into several simple paths. We will then replace the proximity operator over by the series of the proximity operators over the simple paths induced by , which are efficiently computable.

    Iv-C Splitting into Simple Paths

    Let be an infinite random walk on . We recursively define a sequence of stopping time as and for all ,

    if the above set is nonempty, and otherwise. We now define the stopping times for all as Finally, for all we can consider the random variable on with values in defined by

    We denote by the smallest integer such that . We denote by the length of the simple path

    • Given a graph with vertices and a given edge set that is not useful to describe here, consider and the walk with length . Then, , , , and can be decomposed into simple paths and we have , , and Their respective lengths are , , and for all . We identify with

      It is worth noting that, by construction, is a simple path. Moreover, the following statements hold:

      • We have a.s.

      • These three events are equivalent for all : { is a single vertex}, {} and {}

      • The last element of is a.s.

      • a.s.

      In the sequel, we identify the random vector with the random variable As a result, is seen as a r.v with values in

      Our notations are summarized in Table I.

      Graph with no self-loop
      walk on
      infinite random walk
      random walk of length
      random simple path
      length of
      regularization of on
      regularization of along the walk
      TABLE I: Useful Notations

      For every , define the functions on in such a way that


      Note that when then .

      Proposition 3.

      For every , we have


      For every and every ,

      Integrating, and using Prop. 2, it follows that . Moreover, . This completes the proof. ∎

      Iv-D Main Algorithm

      Prop. 3 suggests that minimizers of Problem (1) can be found by minimizing the right-hand side of (18). This can be achieved by means of the stochastic approximation algorithm provided in Sec. III. The corresponding iterations (7) read as where are iid copies of . For every , the intermediate variable given by Eq. (8) satisfies

      Theorem 4.

      Let Ass. 2 hold true. Assume that the convex function is differentiable and that is Lipschitz continuous. Assume that Problem (1) admits a minimizer. Then, there exists a r.v.  s.t. is a minimizer of (1) for all -a.e., and s.t. the sequence defined above converges a.s. to as . Moreover, for every , converges a.s. to .


      It is sufficient to verify that the mappings , defined by (16) and (17) respectively fulfill Ass. 15 of Th. 1. Then, Th. 1 gives the conclusion. Ass. 1 and 3 are trivially satisfied. It remains to show, for every minimizer , the existence of a -representation, for some . Any such satisfies Eq. (12) where satisfies (11). By definition of and , it is straightforward to show that there exists a deterministic constant depending only on and the graph , such that and . This proves Ass. 4. Ass. 5 can be easily checked by the same arguments. ∎

      procedure Snake()
           while stopping criterion is not met do
               if  then
                     is at this step
               end if
           end while
      end procedure
      TABLE II: Proposed Snake algorithm.
      procedure Simple_path()
           while  and Length(c) do
           end while
      end procedure
      TABLE III: Simple_path procedure.

      Consider the general -regularized problem (1), and assume that an efficient procedure to compute the proximity operator of the -regularization over an 1D-graph is available. The sequence is generated by the algorithm Snake (applied with the latter 1D efficient procedure) and is summarized in Table II. Recall the definition of the probability on and the transition kernel on The procedure presented in this table calls the following subroutines.

      • If is a finite walk, is the last element of and Length is its length as a walk that is

      • The procedure Rnd_Oriented_Edge returns a tuple of two nodes randomly chosen where and

      • For every , every simple path and every , Prox1D is any procedure that returns the quantity

      • The procedure Uniform_Neib returns a random vertex drawn uniformly amongst the neighbors of the vertex that is with distribution .

      • The procedure Simple_path, described in Table III, generates the first steps of a random walk on with transition kernel initialized at the vertex , and prefaced by the first node in . It represents the ’s of the previous section. The random walk is stopped when one node is repeated, or until the maximum number of samples is reached. The procedure produces two outputs, the walk and the oriented edge . In the case where the procedure stopped due to a repeated node, represents the simple path obtained by stopping the walk before the first repetition occurs, while is the vertex which has been repeated (referred to as the pivot node). In the case where no vertex is repeated, it means that the procedure stopped because the maximum length was achieved. In that case, represents the last simple path generated, and the algorithm doesn’t use the pivot node .

      • Although Snake converges for every value of the hyperparameter , a natural question is about the influence of on the behavior of the algorithm. In the case where is the TV regularization, [16] notes that, empirically, the taut-string algorithm used to compute the proximity operator has a complexity of order . The same holds for the Laplacian regularization. Hence, parameter controls the complexity of every iteration. On the other hand, in the reformulation of Problem missing(1) into the stochastic form (13), the random variable is an unbiased estimate of . By the ergodic theorem, the larger , the more accurate is the approximation. Hence, there is a trade-off between complexity of an iteration and precision of the algorithm. This trade-off is standard in the machine learning literature. It often appears while sampling mini-batches in order to apply the stochastic gradient algorithm to a deterministic optimization problem(see [19, 20]). The choice of is somehow similar to the problem of the choice of the length of the mini-batches in this context.

        Providing a theoretical rule that would optimally select the value of is a difficult task that is beyond the scope of this paper. Nevertheless, in Sec. VI, we provide a detailed analysis of the influence of on the numerical performance of the algorithm.

      V Proximity operator over 1D-graphs

      We now provide some special cases of -regularizations, for which the computation of the proximity operator over 1D-graphs is easily tractable. Specifically, we address the case of the total variation regularization and the Laplacian regularization which are particular cases of -regularizations.

      V-a Total Variation norm

      In the case where reduces to the weighted TV regularization

      and in the case where reduces to the its unweighted version

      As mentioned above, there exists a fast method, the taut string algorithm, to compute the proximity operator of these regularizations over a 1D-graph ([13, 16]).

      V-B Laplacian regularization

      In the case where reduces to the Laplacian regularization that is

      Its unweighted version is

      In the case where

      is the normalized Laplacian regularization.

      We now explain one method to compute the proximity operator of the unweighted Laplacian regularization over an 1D-graph. The computation of the proximity operator of the normalized Laplacian regularization can be done similarly. The computation of the proximity operator of the weighted Laplacian regularization over an 1D-graph is as fast as the computation the proximity operator of the unweighted Laplacian regularization over an 1D-graph, using for example Thomas’ algorithm.

      The proximity operator of a fixed point is obtained as a solution to a quadratic programming problem of the form:

      where is a scaling parameter. Writing the first order conditions, the solution satisfies


      where is the Laplacian matrix of the 1D-graph with nodes and is the identity matrix in . By [17], can be diagonalized explicitely. In particular, has eigenvalues

      and eigenvectors

      for . Hence, , where gathers the eigenvalues of and the operators and are the discrete cosine transform operator and the inverse discrete cosine transform. The practical computation of can be found in operations.

      Vi Examples

      We now give some practical instances of Problem (1) by particularizing and the -regularization in (1). We also provide some simulations to compare our method to existing algorithms.

      Vi-a Trend Filtering on Graphs

      Consider a vector . The Graph Trend Filtering (GTF) estimate on with parameter set to one is defined in [8] by