Stochastic Primal-Dual Hybrid Gradient Algorithm with Arbitrary Sampling and Imaging ApplicationsSubmitted to the editors 15/06/2017. \fundingA. C. benefited from a support of the ANR, \enquoteEANOI Project I1148 / ANR-12-IS01-0003 (joint with FWF). Part of this work was done while he was hosted in Churchill College and DAMTP, Centre for Mathematical Sciences, University of Cambridge, thanks to a support of the French Embassy in the UK and the Cantab Capital Institute for Mathematics of Information. M. J. E. and C.-B. S. acknowledge support from Leverhulme Trust project \enquoteBreaking the non-convexity barrier, EPSRC grant \enquoteEP/M00483X/1, EPSRC centre \enquoteEP/N014588/1, the Cantab Capital Institute for the Mathematics of Information, and from CHiPS (Horizon 2020 RISE project grant). M. J. E. has carried out initial work supported by the EPSRC platform grant \enquoteEP/M020533/1. Moreover, C.-B. S. is thankful for support by the Alan Turing Institute. P. R. acknowledges the support of EPSRC Fellowship in Mathematical Sciences \enquoteEP/N005538/1 entitled \enquoteRandomized algorithms for extreme convex optimization.

# Stochastic Primal-Dual Hybrid Gradient Algorithm with Arbitrary Sampling and Imaging Applications††thanks: Submitted to the editors 15/06/2017. \fundingA. C. benefited from a support of the ANR, \enquoteEANOI Project I1148 / ANR-12-IS01-0003 (joint with FWF). Part of this work was done while he was hosted in Churchill College and DAMTP, Centre for Mathematical Sciences, University of Cambridge, thanks to a support of the French Embassy in the UK and the Cantab Capital Institute for Mathematics of Information. M. J. E. and C.-B. S. acknowledge support from Leverhulme Trust project \enquoteBreaking the non-convexity barrier, EPSRC grant \enquoteEP/M00483X/1, EPSRC centre \enquoteEP/N014588/1, the Cantab Capital Institute for the Mathematics of Information, and from CHiPS (Horizon 2020 RISE project grant). M. J. E. has carried out initial work supported by the EPSRC platform grant \enquoteEP/M020533/1. Moreover, C.-B. S. is thankful for support by the Alan Turing Institute. P. R. acknowledges the support of EPSRC Fellowship in Mathematical Sciences \enquoteEP/N005538/1 entitled \enquoteRandomized algorithms for extreme convex optimization.

Antonin Chambolle CMAP, Ecole Polytechnique, CNRS, France ().    Matthias J. Ehrhardt Department for Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge CB3 0WA, UK (, ).    Peter Richtárik (i) Visual Computing Center & Extreme Computing Research Center, KAUST, Thuwal, Saudi Arabia; (ii) School of Mathematics, University of Edinburgh, Edinburgh, United Kingdom; (iii) The Alan Turing Institute, London, United Kingdom ().    Carola-Bibiane Schönlieb33footnotemark: 3
###### Abstract

We propose a stochastic extension of the primal-dual hybrid gradient algorithm studied by Chambolle and Pock in 2011 to solve saddle point problems that are separable in the dual variable. The analysis is carried out for general convex-concave saddle point problems and problems that are either partially smooth / strongly convex or fully smooth / strongly convex. We perform the analysis for arbitrary samplings of dual variables, and obtain known deterministic results as a special case. Several variants of our stochastic method significantly outperform the deterministic variant on a variety of imaging tasks.

c

Stochastic Primal-Dual Hybrid Gradient AlgorithmA. Chambolle, M. J. Ehrhardt, P. Richtárik and C.-B. Schönlieb \pdfstringdefDisableCommands\pdfstringdefDisableCommands

onvex optimization, primal-dual algorithms, saddle point problems, stochastic optimization, imaging

{AMS}

65D18, 65K10, 74S60, 90C25, 90C15, 92C55, 94A08

## 1 Introduction

Many modern problems in a variety of disciplines (imaging, machine learning, statistics, etc.) can be formulated as convex optimization problems. Instead of solving the optimization problems directly, it is often advantageous to reformulate the problem as a saddle point problem. A very popular algorithm to solve such saddle point problems is the primal-dual hybrid gradient (PDHG)111We follow the terminology of [13] and call the algorithm simply PDHG. It corresponds to PDHGMu and PDHGMp in [20]. algorithm [36, 20, 12, 35, 13, 14]. It has been used to solve a vast amount of state-of-the-art problems—to name a few examples in imaging: image denoising with the structure tensor [21], total generalized variation denoising [10], dynamic regularization [6], multi-modal medical imaging [26], multi-spectral medical imaging [42], computation of non-linear eigenfunctions [25], regularization with directional total generalized variation [28]. Its popularity stems from two facts: First, it is very simple and therefore easy to implement. Second, it involves only simple operations like matrix-vector multiplications and evaluations of proximal operators which are for many problems of interest simple and in closed-form or easy to compute iteratively, cf. e.g. [32]. However, for large problems that are encountered in many real world applications, even these simple operations might be still too costly to perform too often.

We propose a stochastic extension of the PDHG for saddle point problems that are separable in the dual variable (cf. e.g. [17, 51, 53, 33]) where not all but only a few of these operations are performed in every iteration. Moreover, as in incremental optimization algorithms [46, 30, 9, 8, 7, 44, 18] over the course of the iterations we continuously build up information from previous iterations which reduces variance and thereby negative effects of stochasticity. Non-uniform samplings [39, 37, 51, 38, 2] have been proven very efficient for stochastic optimization. In this work we use the expected separable overapproximation framework of [37, 38, 40] to prove all statements for all non-trivial and iteration-independent samplings.

#### Related Work

The proposed algorithm can be seen as a generalization of the algorithm of [17, 53, 51] to arbitrary blocks and a much wider class of samplings. Moreover, in contrast to their results, our results generalize the deterministic case considered in [36, 12, 35, 14]. Fercoq and Bianchi [22] proposed a stochastic primal-dual algorithm with explicit gradient steps that allows for larger step sizes by averaging over previous iterates, however, this comes at the cost of prohibitively large memory requirements. Similar memory issues are encountered by a primal-dual algorithm of [3]. It is related to forward-backward splitting [29] and averaged gradient descent [9, 19] and therefore suffers the same memory issues as the averaged gradient descent. Moreover, Valkonen proposed a stochastic primal-dual algorithm that can exploit partial strong convexity of the saddle point functional [47]. Randomized versions of the alternating direction method of multipliers are discussed for instance in [52, 24]. In contrast to other works on stochastic primal-dual algorithms [34, 50], our analysis is not based on Fejér monotonicity [15]. We therefore do not prove almost sure convergence of the sequence but prove a variety of convergence rates depending on strong convexity assumptions instead.

As a word of warning, our contribution should not be mistaken by other \enquotestochastic primal-dual algorithms, where errors in the computation of matrix-vector products and evaluation of proximal operators are modeled by random variables, cf. e.g. [34, 15, 43]. In our work we deliberately choose to compute only a subset of a whole iteration to save computational cost. These two notations are related but are certainly not the same.

### 1.1 Contributions

We briefly mention the main contributions of our work.

#### Generalization of Deterministic Case

The proposed stochastic algorithm is a direct generalization of the deterministic setting [36, 12, 35, 13, 14]. In the degenerate case where in every iteration all computations are performed, our algorithm coincides with the original deterministic algorithm. Moreover, the same holds true for our analysis of the stochastic algorithm where we recover almost all deterministic statements [12, 35] in this degenerate case. Therefore, the theorems for both the deterministic and the stochastic case can be combined by a single proof.

#### Better Rates

Our analysis extends the simple setting of [51] such that the strong convexity assumptions and the sampling do not have to be uniform. Even in the special case of uniform strong convexity and uniform sampling, the proven convergence rates are slightly better than the ones proven in [51].

#### Arbitrary Sampling

The proposed algorithm is guaranteed to converge under a very general class of samplings [37, 38, 40] and thereby generalizes also the algorithm of [51] which has only been analyzed for two specific samplings. As long as the sampling is independent and identically distributed over the iterations and all computations have non-zero probability to be carried out, the theory holds and the algorithm will converge with the proven convergence rates.

#### Acceleration

We propose an acceleration of the stochastic primal-dual algorithm which accelerates the convergence from to if parts of the saddle point functional are strongly convex and thereby results in a significantly faster algorithm.

#### Scaling Invariance

In the strongly convex case, we propose parameters for several serial samplings (uniform, importance, optimal), all based on the condition numbers of the problem and thereby independent of scaling.

## 2 General Problem

Let be real Hilbert spaces of any dimension and define the product space . For , we shall write , where . Further, we consider the natural inner product on the product space given by , where . This inner product induces the norm . Moreover, for simplicity we will consider the space that combines both primal and dual variables.

Let be a bounded linear operator. Due to the product space nature of , we have , where are linear operators. The adjoint of is given by . Moreover, let and be convex functions. In particular, we assume that is separable, i.e. .

Given the setup described above, we consider the optimization problem

 minx∈X{Φ(x):=n∑i=1fi(Aix)+g(x)}. (1)

Instead of solving Eq. 1 directly, it is often desirable to reformulate the problem as a saddle point problem with the help of the Fenchel conjugate. If is proper, convex, and lower semi-continuous, then where , is the Fenchel conjugate of (and its biconjugate defined as the conjugate of the conjugate). Then solving Eq. 1 is equivalent to finding the primal part of a solution to the saddle point problem (called a saddle point)

 minpx∈Xsupy∈Y{Ψ(x,y)\coloneqqn∑i=1⟨Aix,yi⟩−f∗i(yi)+g(x)}. (2)

We will assume that the saddle point problem Eq. 2 has a solution. For conditions for existence and uniqueness, we refer the reader to [4]. A saddle point satisfies the optimality conditions

 Aix♯∈∂f∗i(y♯i)i=1,…,n,−A∗y♯∈∂g(x♯).

An important notion in this work is strong convexity. A functional is called -convex if is convex. In general, we assume that is -convex, is -convex with nonnegative strong convexity parameters . The convergence results in this contribution cover three different cases of regularity: i) no strong convexity , ii) semi strong convexity or and iii) full strong convexity . For notational convenience we make use of the operator .

A very popular algorithm to solve the saddle point problem Eq. 2 is the Primal-Dual Hybrid Gradient (PDHG) algorithm [36, 20, 12, 35, 13, 14]. It reads (with extrapolation on )

 x(k+1) =proxτg(x(k)−τA∗¯¯¯y(k)) y(k+1) =proxσf∗(y(k)+σAx(k+1)) ¯¯¯y(k+1) =y(k+1)+θ(y(k+1)−y(k)),

where the proximal operator (or proximity / resolvent operator) is defined as

 proxτf(y)\coloneqqargminx∈X{12∥x−y∥2τ−1+f(x)}

and the weighted norm by . Its convergence is guaranteed if the step size parameters are positive and satisfy [12]. Note that the definition of the proximal operator is well-defined for an operator-valued step size . In the case of a separable function and with operator-valued step sizes the PDHG algorithm takes the form \cref@addtoresetequationparentequation

 x(k+1) =proxTg(x(k)−TA∗¯¯¯y(k)) (3a) y(k+1)i =proxSif∗i(y(k)i+SiAix(k+1))i=1,…,n (3b) ¯¯¯y(k+1) =y(k+1)+θ(y(k+1)−y(k)). (3c)

Here the step size parameters (a block diagonal operator), and are symmetric and positive definite. The algorithm is guaranteed to converge if and [35].

## 3 Algorithm

In this work we extend the PDHG algorithm to a stochastic setting where in each iteration we update a random subset of the dual variables Eq. 3b. This subset is sampled in an i.i.d. fashion from a fixed but otherwise arbitrary distribution, whence the name \enquotearbitrary sampling. In order to guarantee convergence, it is necessary to assume that the sampling is \enquoteproper [41, 38]. A sampling is proper if for each dual variable we have with a positive probability . Examples of proper samplings include the full sampling where with probability 1 and serial sampling where is chosen with probability . It is important to note that also other samplings are admissible. For instance for , consider the sampling that selects with probability and with probability . Then the probabilities for the three blocks are , and which makes it a proper sampling. However, if only is chosen with probability 1, then this sampling is not proper as the probability for the third block is zero: .

The algorithm we propose is formalized as Algorithm 1. As in the original PDHG, the step size parameters have to be self-adjoint and positive definite operators for the updates to be well-defined. The extrapolation is performed with a scalar and an operator of probabilities that an index is selected in each iteration.

###### Remark

Both, the primal and dual iterates and are random variables but only the dual iterate depends on the sampling . However, depends of course on all previous samplings .

###### Remark

Due to the sampling each iteration requires both and to be evaluated only for each selected index . To see this, note that

 A∗¯¯¯y(k+1) =A∗y(k)+∑i∈S(k+1)(1+θpi)A∗i(y(k+1)i−y(k)i)

where can be stored from the previous iteration (needs the same memory as the primal variable ) and the operators are evaluated only for .

## 4 General Convex Case

We first analyze the convergence of Algorithm 1 in the general convex case without making use of any strong convexity or smoothness assumptions. In order to analyze the convergence for the large class of samplings described in the previous section we make use of the expected separable overapproximation (ESO) inequality [38].

###### Definition 1 (Expected Separable Overapproximation (ESO) [38])

Let be a random set and the probability that is in . We say that fulfill the ESO inequality (depending on and ) if for all it holds that

 ES∥∥ ∥∥∑i∈SC∗izi∥∥ ∥∥2≤n∑i=1pivi∥zi∥2. (4)

Such parameters are called ESO parameters.

###### Remark

Note that for any bounded linear operator such parameters always exist but are obviously not unique. For the efficiency of the algorithm it is desirable to find ESO parameters such that Eq. 4 is as tight as possible; i.e., we want the parameters be small. As we shall see, the ESO parameters influence the choice of the extrapolation parameter in the strongly convex case.

The ESO inequality was first proposed by Richtárik and Takáč [41] to study parallel coordinate descent methods in the context of uniform samplings, which are samplings for which for all . Improved bounds for ESO parameters were obtained in [23] and used in the context of accelerated coordinate descent. Qu et al. [38] perform an in-depth study of ESO parameters. The ESO inequality is also critical in the study mini-batch stochastic gradient descent with [27] or without [45] variance reduction.

We will frequently need to estimate the expected value of inner products which we will do by means of ESO parameters. Recall that we defined weighted norms as . The proof of this lemma can be found in the appendix.

###### Lemma 1

Let be a random set and if and otherwise. Moreover, let be some ESO parameters of and . Then for any and

 2ES⟨QAx,y+−y⟩≥−ES{1c∥x∥2T−1+cmaxivipi∥y+−y∥2QS−1}.

###### Example (Full Sampling)

Let with probability 1 such that . Then are ESO parameters of . Thus, the deterministic condition on convergence, , implies a bound on some ESO parameters .

###### Example (Serial Sampling)

Let be chosen with probability . Then are ESO parameters of .

The analysis for the general convex case will use the notation of Bregman distance which is defined for any function , and in the subdifferential of at as

 Dqf(x,y)\coloneqqf(x)−f(y)−⟨q,x−y⟩.

Next to Bregman distances, one can measure optimality by the partial primal-dual gap. Let , then we define the partial primal-dual gap as

 GB1×B2(x,y)\coloneqqsup~y∈B2Ψ(x,~y)−inf~x∈B1Ψ(~x,y).

It is convenient to define and to denote the gap as Note that if contains a saddle point , then we have that

 GB(w)≥Ψ(x,y♯)−Ψ(x♯,y)=D−A∗y♯g(x,x♯)+DAx♯f∗(y,y♯)=Dqh(w,w♯)≥0

where the first equality is obtained by adding a zero and we used and for the last equality. The non-negativity stems from the fact that Bregman distances of convex functionals are non-negative and is convex indeed.

We will make frequent use of the following \enquotedistance functions

 Fi(yi|~x,~yi)\coloneqqf∗i(yi)−f∗i(~yi)−⟨Ai~x,yi−~yi⟩

and Note that these are strongly related to Bregman distances; if is a saddle point, then is the Bregman distance of between and . Similarly, we make use of the weighted distance

 Fp(y|~w)\coloneqqn∑i=1(1pi−1)Fi(yi|~x,~yi)

and the distance for the primal functional We note that these distances are also related to the partial primal-dual gap as

 GB(w)=sup~w∈BG(x|~w)+F(y|~w)=sup~w∈BH(w|~w).
###### Theorem 4.1

Let be convex, and be chosen so that there exist ESO parameters of with

 vi

Then, the Bregman distance between iterates of Algorithm 1 and any saddle point converges to zero almost surely,

 Dqh(w(k),w♯)→0a.s. (6)

Moreover, the ergodic sequence converges with rate in an expected partial primal-dual gap sense, i.e. for any set it holds

 EGB(w(K))≤CBK (7)

where the constant is given by

 CB\coloneqqsupx∈B112∥x(0)−x∥2T−1+supy∈B212∥y(0)−y∥2QS−1+supw∈BFp(y(0)|w). (8)

The same rate holds for the expected Bregman distance, .

###### Remark

The convergence Eq. 6 in Bregman distance implies convergence in norm as soon as is strictly convex. If is not strictly convex, then the convergence has to be seen in a more generalized sense. For example, if is a -norm (and thus not strictly convex), then the Bregman distance between and is zero if and only if they have the same support and sign. Thus, the convergence statement is related to the support and sign of . In the extreme case , then and the convergence statement has no meaning.

The proof of this theorem utilizes a standard inequality for which we provide the proof in the appendix for completeness.

###### Lemma 2

 x(k+1) =proxT(k)g(x(k)−T(k)A∗¯¯¯y(k)) ^y(k+1)i =prox[S(k)]if∗i(y(k)i+[S(k)]iAix(k+1))i=1,…,n

with iteration varying step sizes and . Then for any it holds that

 ∥x(k)−x∥2T−1(k)+∥y(k)−y∥2S−1(k) ≥∥x(k+1)−x∥2T−1(k)+μgI+∥^y(k+1)−y∥2S−1(k)+M +2(G(x(k+1)|w)+F(^y(k+1)|w))−2⟨A(x(k+1)−x),^y(k+1)−¯¯¯y(k)⟩ +∥x(k+1)−x(k)∥2T−1(k)+∥^y(k+1)−y(k)∥2S−1(k).

###### Proof (Proof of Theorem 4.1)

The result of Lemma 2 (with constant step sizes) has to be adapted to the stochastic setting as the dual iterate is updated only with a certain probability. First, a trivial observation is that for any mapping it holds that

 φ(^y(k+1)i) =1piE(k+1)φ(y(k+1)i)−(1pi−1)φ(y(k)i) =(1pi−1)E(k+1)φ(y(k+1)i)−(1pi−1)φ(y(k)i)+E(k+1)φ(y(k+1)i). (9)

Thus, for the generalized distance of we arrive at

 F(^y(k+1)|w) =E(k+1)Fp(y(k+1)|w)−Fp(y(k)|w)+E(k+1)F(y(k+1)|w). (10)

and for any block diagonal matrix

 ∥^y(k+1)−⋅∥2B =E(k+1)∥y(k+1)−⋅∥2QB−∥y(k)−⋅∥2(Q−I)B, (11) ^y(k+1) =QE(k+1)y(k+1)−(Q−I)y(k). (12)

Using Eqs. 12, 11 and 10, we can rewrite the estimate of Lemma 2 as

 ∥x(k)−x∥2T−1+∥y(k)−y∥2QS−1+2Fp(y(k)|w) ≥E(k+1){∥x(k+1)−x∥2T−1+∥y(k+1)−y∥2QS−1+2Fp(y(k+1)|w) +2H(w(k+1)|w)−2⟨A(x(k+1)−x),Q(y(k+1)−y(k))+y(k)−¯¯¯y(k)⟩ +∥x(k+1)−x(k)∥2T−1+∥y(k+1)−y(k)∥2QS−1}. (13)

where we have used the identity

 ∥⋅∥2B+∥⋅∥2D=∥⋅∥2B+D (14)

to simplify the expression. With the extrapolation , the inner product term can be reformulated as

 −⟨A(x(k+1)−x),Q(y(k+1)−y(k))+y(k)−¯¯¯y(k)⟩ =−⟨QA(x(k+1)−x),y(k+1)−y(k)⟩+⟨QA(x(k+1)−x),y(k)−y(k−1)⟩ =−⟨QA(x(k+1)−x),y(k+1)−y(k)⟩+⟨QA(x(k)−x),y(k)−y(k−1)⟩ +⟨QA(x(k+1)−x(k)),y(k)−y(k−1)⟩ (15)

and with Lemma 1 and it holds that

 2E(k)⟨QA(x(k+1)−x(k)),y(k)−y(k−1)⟩ ≥−E(k){γ2∥x(k+1)−x(k)∥2T−1+∥y(k)−y(k−1)∥2QS−1}. (16)

Taking expectations with respect to (denoting this by ) on Eq. 13, using the estimates Eqs. 16 and 15 and denoting

 Δ(k)\coloneqqE{ ∥x(k)−x∥2T−1+∥y(k)−y∥2QS−1+2Fp(y(k)|w) +∥y(k)−y(k−1)∥2QS−1−2⟨QA(x(k)−x),y(k)−y(k−1)⟩}

leads with (follows directly from Eq. 5) to

 Δ(k) ≥Δ(k+1)+E(2H(w(k+1)|w)+(1−γ2)∥x(k+1)−x(k)∥2T−1) ≥Δ(k+1)+2EH(w(k+1)|w). (17)

Summing Eq. 17 over (note that ) and using the estimate (follows directly from Lemma 1)

 Δ(K)≥(1−γ2)∥x(K)−x∥2T−1+∥y(K)−y∥2QS−1+2Fp(y(K)|w)≥2Fp(y(K)|w)

yields

 E{Fp(y(K)|w)+K∑k=1H(w(k)|w)}≤Δ(0)2. (18)

All assertions of the theorem follow from inequality Eq. 18. Inserting a saddle point and taking the limit , it follows from Eq. 18 that which implies almost surely and thus Eq. 6.

To see Eq. 7, note first that

 Fp(y(0)|w)−Fp(y(K)|w)=Fp(y(0)|x,y(K))≤supw∈BFp(y(0)|w)

and if with as defined in Eq. 8. Moreover, the generalized distance is convex, thus, dividing Eq. 18 by yields

 EH(w(K)|w)≤1KEK∑k=1H(w(k)|w)≤CBK

for any . Taking the supremum over yields Eq. 7. Noting that completes the proof.

## 5 Semi-Strongly Convex Case

In this section we propose two algorithms that converge as if either or is strongly convex. For simplicity we restrict ourselves from now on to scalar-valued step sizes, i.e. and . However, large parts of what follows holds true for operator-valued step sizes, too.