Projection Algorithms for Finite Sum

# Projection Algorithms for Finite Sum Constrained Optimization

Hong-Kun Xu School of Science, Hangzhou Dianzi University, Hangzhou 310018, China  and  Vera Roshchina School of Mathematics and Statistics, University of New South Wales, Sydney, NSW 2052, Australia
###### Abstract.

Parallel and cyclic projection algorithms are proposed for minimizing the sum of a finite family of convex functions over the intersection of a finite family of closed convex subsets of a Hilbert space. These algorithms are of predictor-corrector type, with each main iteration consisting of an inner cycle of subgradient descent process followed by a projection step. We prove the convergence of these methods to an optimal solution of the composite minimization problem under investigation upon assuming boundedness of the gradients at the iterates of the local functions and the stepsizes being chosen appropriately, in the finite-dimensional setting. We also discuss generalizations and limitations of the proposed algorithms and our techniques.

###### Key words and phrases:
convex feasibility, composite minimization, projection algorithm.
###### 2010 Mathematics Subject Classification:
90C25, 90C52, 65K10, 47J25.
Corresponding author.

## 1. Introduction

We are concerned with a composite minimization problem, that is, we consider the case where the objective function is decomposed into the sum of a finite family of convex functions and the set of constraints is the intersection of finitely many closed convex subsets of a real Hilbert space . Precisely, the minimization problem under investigation in this paper is of the form

 (1.1) minx∈C:=⋂Mi=1Cif(x):=N∑j=1fj(x),

where are positive integers, each set is a nonempty closed convex subset of , and each component function is a convex function. We always assume the feasible set .

Large-scale optimization problems of form (1.1) naturally arise in modern applications, in particular, network design [16, 12] and machine learning [14, 26, 15]. When the constraint of (1.1) is defined explicitly by the system of inequalities, penalty and augmented Lagrangian techniques, as well as proximal and bundle methods can be applied to this problem. However, when projections onto the constraint sets are readily available, the treatment of constraints via projections techniques may be preferable as computationally robust and memory efficient. One approach that allows to apply projection methods to (1.1) is to replace the optimization problem (1.1) with a sequence of CFPs as is done in [13]. Our development is more direct: we build on the ideas of [11] to prove the convergence of subgradient projection techniques that utilize projections onto individual constraint sets. We note that despite a large body of work dedicated to solving convex feasibility problems via projection methods (see [23, 5, 17, 4, 8] for recent advancements and [3, 7] for textbook exposition) and vast literature on optimization methods that utilize a single projection onto the constraint set (for recent works see, e.g. [18, 25, 19]), little is done in combining optimization and projection steps on several sets, beyond the aforementioned paper by De Pierro and Helou Neto [11]. Our aim is to make a substantial contribution towards bridging this gap. Recent progress on forcing the convergence of Douglas–Rachford type methods to the smallest norm feasible point [1] also indicates that it may be possible to extend our approach to a larger class of projection techniques.

The convex feasibility problem (CFP) [2, 9] is formulated as

 (1.2) finding a point x∗ with the property: x∗∈N⋂i=1Ci.

Thus, the composite minimization problem (1.1) can alternatively be rephrased as finding a solution to the convex feasibility problem (1.2) which also minimizes the composite function as defined in (1.1). Consequently, two points should be taken into consideration of algorithmic approaches to (1.1):

• the descent property of the values of the objective function , and

• the (approximate) feasibility of the iterates generated by the algorithm.

To illustrate these points we consider the special case where and the function is smooth. In this case, (1.1) is reduced to the constrained convex minimization:

 (1.3) minx∈C1f1(x).

The gradient-projection algorithm (GPA) can solve (1.3): GPA generates a sequence by the recursion process:

 (1.4) xk+1=PC1(xk−λk∇f1(xk)),

where the initial guess is chosen arbitrarily, and is the stepsize. Assume:

1. The gradient of , , is -Lipschitz (for some ):

 ∥∇f1(x)−∇f1(z)∥≤α∥x−z∥,x,z∈H;
2. The sequence of stepsizes, , satisfies the condition:

 0

It is then easy to find that both points (a) and (b) hold (actually, (b) holds trivially); moreover, the sequence generated by GPA (1.4) converges [22, 28] weakly to a solution of (1.3) (if any).

Observe that the splitting of the objective function into the sum of (simpler) component functions, and the set of constraints into the intersection of (simpler) convex subsets aims at providing more efficient algorithmic approaches to (1.1) by utilizing the simpler structures of the component functions (for instance, the proximal mappings of are computable [10]) and of the sets (for instance, the projections possess closed formulae). This means that when we study algorithms for the composite optimization problem (1.1), we should use individual component functions and individual subsets at each iteration, not the full sum of the component functions , nor the full intersection of the sets .

The purpose of this paper is to analyse the convergence of parallel and cyclic projection algorithms for solving the optimization problem (1.1), significantly expanding the results of De Pierro and Helou Neto in [11] who focussed on the sequential projections version of the method. We provide a unified analysis of all three methods in the finite-dimensional setting.

The projection algorithms studied in this paper start with an arbitrary point and produce the iterates (), alternating between subgradient and projection steps.

The generic form of our projection algorithm is as follows.

 (Projection algorithm) ⎧⎪⎨⎪⎩xk,0=xk,xk,j=xk,j−1−λkvk,j,vk,j∈∂fj(xk,j−1),j=1,2,⋯,N,xk+1=Vk+1(xk,N),

Here by we denote the Moreau-Rockafellar subdifferential of the convex function at a point for any , and is the (modification of) projection operator, distinguishing the three methods. We have explicitly for

 Vk+1:=⎧⎪ ⎪⎨⎪ ⎪⎩PCM⋯PC1, for sequential % projections;PC[k+1],[k+1]=(kmodM)+1, for cyclic projections;∑Mi=1βiPCi,βi>0∀i,∑Mi=1βi=1, for parallel projections.

The sequential projection algorithm was introduced by De Pierro and Helou Neto in [11], in this case the projection step is a full cycle of projections onto the sets whose intersection comprises the feasibility region. Explicitly we have

 (Sequential projections) ⎧⎪⎨⎪⎩xk,0=xk,xk,j=xk,j−1−λkvk,j,vk,j∈∂fj(xk,j−1),j=1,2,⋯,N,xk+1=PCM⋯PC1xk,N.

In the finite-dimensional case, De Pierro and Helou Neto discussed the convergence properties of the above algorithm (note that generalized the original method slightly, replacing gradients with subgradients; this does not affect the convergence analysis that relies on the convexity of the component objective functions rather than their differentiability). Moreover, they raised several open questions regarding projection algorithms for solving (1.1), one of which is whether the sequential projections in their algorithm can be replaced with the parallel projections. We answer this question in the affirmative, not only for parallel, but also for the cyclic version of the algorithm.

Our main result is the following direct generalization of [11, Theorem 1].

###### Theorem 1.1.

Let , suppose that the sets are closed and convex, and let . Assume that the real-valued convex functions , …, are defined on some convex subsets , …, of such that , , (for a choice of cyclic, sequential or parallel projection algorithm) and there exist constants , …, such that

 maxv∈∂fj(xk,j−1)∥v∥≤Lj,j=1,2,⋯,N,k≥0.

Moreover, assume that the sequence (obtained via the chosen method) is bounded and

 0<λk→0and∞∑k=0λk=∞.

Then the sequence converges to the optimal value , and every cluster point of is an optimal solution of (1.1), given that the solution set is nonempty.

Note that our assumptions are standard in the analysis of numerical methods, and can be replaced by more constructive or convenient conditions, with some loss of generality.

The proof of our main result (Theorem 1.1) relies on the key property of asymptotic feasibility (that ensures the cluster points of the iterative sequence converge to the feasible set). We prove asymptotic feasibility for the methods of parallel and cyclic projections in Section 3, and present the complete proof of Theorem 1.1 in Section 4. Note that even though we follow the general framework of De Pierro and Helou Neto, our proofs of asymptotic feasibility for cyclic and parallel projections are based on entirely different ideas.

We begin our discussion with introducing some notation and other preliminary information and results in Section 2, and after presenting the proof of the main results in Sections 3 and 4, provide a discussion of some generalizations including the infinite-dimensional setting, and some practical improvements and modifications of the methods.

## 2. Notation and Preliminaries

The fundamental tool of our argument in this paper is the concept of projections. Let be a real Hilbert space with inner product and norm , respectively, and let be a nonempty closed convex subset of . The (nearest point) projection from onto , dented by , is defined by

 (2.1) PCx:=argminy∈C∥x−y∥,x∈H.

The following well-known properties are pertinent to our argument in Section 3.

###### Proposition 2.1.

Let be a real Hilbert space, and for any closed convex set let be the projector operator defined by (2.1). Then the following properties hold.

• for all and .

• for all ; in particular, is nonexpansive, namely,

 ∥PCx−PCy∥≤∥x−y∥,x,y∈H.
• for all and .

We also define the distance function from a point to a set as

 dE(x):=inf{∥x−y∥:y∈E}.

Observe that for a closed convex set we have .

As mentioned earlier, the CFP (1.2) can be solved by the projection onto convex sets method (POCS), whose convergence is well-understood in the general context of real Hilbert spaces. We recall the well-known convergence results of two major POCS algorithms [2, 9, 21, 27].

###### Theorem 2.2.

Beginning with an arbitrarily chosen initial guess , we iterate in either one of the following two projection algorithms:

1. Sequential (cyclic) projections: ;

2. Parallel projections: , with for all and ;

Then converges weakly to a solution of CFP (1.2), given that this solution set is nonempty.

Another key notion in our discussion is that of a convex function and Moreau–Rockafellar subdifferential [6]. Let be a convex subset of , and let be a convex function. A subgradient of at is a vector such that

 f(y)≥f(x)+⟨y−x,v⟩∀y∈D.

The set of all subgradients of at is called the subdifferential and is denoted by .

Let

 S∗:={x∗∈C:f(x∗)=infx∈Cf(x)}andf∗:=infx∈Cf(x)

be the set of optimal solutions and the optimal value of the composite minimization problem (1.1), respectively. We shall always assume from now and onwards that .

Two problems are pertinent:

1. The sequence would (weakly) converge to an optimal solution ;

2. The sequence would converge to the optimal value .

If the answer to (a) is affirmative, then the answer to (b) is also positive.

The assumptions of Theorem 1.1 play a key role in establishing the aforementioned properties. We state and discuss these assumptions here explicitly for the clarity of exposition.

First, we make a standard assumption on the divergence of the series of diminishing stepsizes used at the gradient cycle of our projection algorithm: we require that

 (2.2) 0<λk→0and∞∑k=0λk=∞.

The first condition ensures that the steps we make are indeed descent steps, and that the gradient step does not derail our progress with the convergence of projection steps to the feasible set. The second condition ensures that there is no artificial restriction on how far can the sequence of iterates depart from the initial point.

The second key assumption is a uniform Lipschitz bound on the components of the objective function. Explicitly, we use the following assumption on the subgradients of our functions,

 (2.3) maxv∈∂fj(xk,j−1)∥v∥≤Lj,j=1,2,⋯,N,k≥0,

and we also let . Observe that this condition is satisfied naturally when these (real-valued) functions are defined on the whole finite-dimensional space and the sequence is bounded. It is also well-known (see [2, Proposition 7.8]) that the condition of a function having bounded gradients (subdifferentials) on bounded sets is equivalent to the function being bounded on bounded sets in the finite-dimensional setting.

## 3. Asymptotic Feasibility of Parallel and Cyclic Projections

We are ready to prove two major technical results that concern the asymptotic feasibility of parallel and cyclic projections (Lemmas 3.1 and 3.6 respectively). Note that the relevant statement for the sequential projections was shown in [11].

### 3.1. Asymptotic Feasibility for Parallel Projections

Recall that the parallel projection algorithm (PPA) utilizes a convex combination of the projections on the sets ,…, on its projection step:

 (PPA) ⎧⎪⎨⎪⎩xk,0=xk,xk,j=xk,j−1−λkvk,j,vk,j∈∂fj(xk,j−1),j=1,2,⋯,N,xk+1=∑Mi=1βiPCixk,N,βi>0∀i,∑Mi=1βi=1.

Our goal is to prove the following result. We begin with several technical claims that we use in the proof that is deferred to the end of this subsection.

###### Lemma 3.1.

Assume , (2.3), and , and that the sequence generated by the method of parallel projections is bounded. Then is asymptotically feasible, that is, .

The following technical result is used in the subsequent proofs.

###### Lemma 3.2.

Let be a sequence generated by the parallel projections algorithm and assume that the Lipschitz condition (2.3) is satisfied. Then

1. , where .

2. for .

3. .

###### Proof.

(i) We have

 ∥xk,N−xk∥≤N∑j=1∥xk,j−xk,j−1∥=N∑j=1λk∥vj,k∥≤N∑j=1λkLj=Lλk.

(ii) For , we have

 ∥xk+1−z∥2 =∥∥ ∥∥M∑j=1βjPCjxk,N−z∥∥ ∥∥2 ≤M∑j=1βj∥PCjxk,N−z∥2by convexity of ∥⋅∥2 ≤M∑j=1βj(∥xk,N−z∥2−∥xk,N−PCjxk,N∥2)(by Proposition~{}???(iii)) =∥xk,N−z∥2−M∑j=1βjd2Cj(xk,N).

(iii) This is a straightforward consequence of (ii).

(iv) This is easily derived from (iii), (i) and the fact that a distance function of a convex set is Lipschitz continuous with Lipschitz constant one:

 |dK(x)−dK(y)|≤∥x−y∥.

###### Lemma 3.3.

Assume , , the condition (2.3) is satisfied, and is bounded. Then for any , there exists such that

 (3.1) d2C(xk+1)≤d2C(xk)−δ

whenever is such that . Consequently, .

###### Proof.

Suppose not; then for some , we have a subsequence of such that and

 (3.2) d2C(xkl+1)>d2C(xkl)−1l

for all . It turns out from Lemma 3.2(iv) that

 (3.3) M∑j=1βjd2Cj(xkl)≤d2C(xkl)−d2C(xkl+1)+O(λkl)<1l+O(λkl)→0 (as l→∞).

Since is a bounded sequence in a finite dimensional space, we may assume that . We then get

 (3.4) M∑j=1βjd2Cj(^x)=0.

This implies that for every ; hence, . This contradicts the fact that . ∎

We are now ready to prove Lemma 3.1.

###### Proof of Lemma 3.1.

By Lemma 3.3 we have , hence, we can take such that and for all . Let . Consider two cases.

Case 1: . In this case, we have by Lemma 3.2(iii)

 (3.5) dC(xk+1)≤dC(xk,N)≤dC(xk)+∥xk−xk,N∥≤dC(xk)+λkL<32ε.

Case 2: . Using Lemma 3.3, we obtain .

We now prove, for all ,

 (3.6) dC(xk0+i)<32ε.

Indeed, (3.6) is trivial when . Assume (3.6) holds for . If , then ; if , then, by Case 1, we get . Hence, (3.6) also holds for .

Now it turns out from (3.6) that , and Lemma 3.1 is proven. ∎

Note that Lemmas 3.3 and 3.1 can be generalized for the infinite-dimensional setting. We discuss this in more detail in Section 5.1.

###### Remark 3.4.

We include a version of Lemma 3.2 for the sequential projection algorithm (SPA) that generates a sequence via the following iteration process:

 (SPA) ⎧⎪⎨⎪⎩xk,0=xk,xk,j=xk,j−1−λkvk,j, vk,j∈∂fj(xk,j−1), j=1,2,⋯,N,xk+1=PCM⋯PC1xk,N.
###### Lemma 3.5.

Let be generated by (SPA) and assume that the Lipschitz condition (2.3) is satisfied. Then

1. , where .

2. for .

3. .

4. .

Here and we use the convention . Note that as .

The proof of Lemma 3.5 follows the same line of the proof of Lemma 3.2. For instance, part (ii) can be proved by consecutively applying property (iii) of projections in Proposition 2.1 (it is also proved in [11]). Part (iv) can trivially be derived from (iii) by using the Lipschitz-1 property of distance functions.

By Lemma 3.5, we find that the conclusion of Lemma 3.3 holds true also for the SPA.

### 3.2. Asymptotic Feasibility for Cyclic Projections

Recall that the cyclic projection algorithm (CPA) alternates the full sequence of gradient steps with the individual projections on each one of the sets , …, , as follows.

 (CPA) ⎧⎪⎨⎪⎩xk,0=xk,xk,j=xk,j−1−λkvk,j,vk,j∈∂fj(xk,j−1),j=1,2,⋯,N,xk+1=PC[k+1]xk,N,[k+1]=(kmodM)+1.

Our goal is to prove the following asymptotic feasibility result that mirrors Lemma 3.1.

###### Lemma 3.6.

Assume , (2.3), and , and that the sequence generated by the method of cyclic projections is bounded. Then is asymptotically feasible, that is, .

To prove this lemma, we need several technical claims. First, for any and define the exact -cyclic projection

 (3.7) Pq(x):=PCqPCq−1⋯PC1PCMPCM−1⋯PCq+1(x).

We next show that such cyclic projections bring the iterations closer to the feasible set in a uniform sense.

###### Proposition 3.7.

Let be a nonempty compact convex subset of such that and . For each define a function ,

 (3.8) ψqX(α):=supd(x,C)≤αx∈Xd(Pq(x),C).

The function is continuous and for all .

###### Proof.

We assume throughout that the compact convex set and the index are fixed and use the notation . We first show that for . For any closed convex set we have by Proposition 2.1(iii)

 ∥x−y∥2≥∥PS(x)−y∥2+∥PS(x)−x∥2,

hence, for our setting

 ∥x−y∥2 ≥∥PCq+1(x)−y∥2+∥PCq+1(x)−x∥2 ≥∥PCq+2PCq+1(x)−y∥2+∥PCq+2PCq+1(x)−PCq+1(x)∥2+∥PCq+1(x)−x∥2 ≥⋯ ≥∥PCq⋯PCq+2PCq+1(x)−y∥2+⋯+∥PC2PC1(x)−PC1(x)∥2+∥PC1(x)−x∥2.

It is evident then that if , we have

 ∥x−y∥2≥∥Pq(x)−y∥2+γ(x)∀y∈C,

where does not depend on . Therefore, taking the infimum over , we have for every

 d2(x,C) =infy∈C∥x−y∥2 ≥infy∈C∥Pq(x)−y∥2+γ(x) (3.9) =d2(Pq(x),C)+γ(x),

and so

 (3.10) d(x,C)>d(Pq(x),C) for every x∉C.

Now let

 Xα:=X∩{x|d(x,C)≤α}.

Observe that explicitly

 (3.11) ψ(α)=supx∈Xαd(Pq(x),C).

The set is compact because it is the intersection of a compact set with a closed set , and is nonempty for every because . The function is continuous in , and since each of the sets is compact and nonempty, the supremum in (3.11) is attained, and we have

 (3.12) ψ(α)=maxx∈Xαd(Pq(x),C)∀α≥0.

Hence, for every there exists such that and

 ψ(α)=d(Pq(xα),C).

If , then . If , we have and from (3.10)

 ψ(α)=d(Pq(xα),C)

We next focus on showing that is continuous. Since

 Xα⊆Xβfor 0≤α≤β,

the function is nondecreasing, and to prove its continuity it is sufficient to show

 (3.13) liminfα↑¯αψ(α)≥ψ(¯α),∀¯α>0,andlimsupα↓¯αψ(α)≤ψ(¯α),∀¯α≥0.

If , since is nondecreasing, we have , so for all and the first relation in (3.13) holds trivially. Consider the case . From (3.12) we know that there exists such that and . Let (so that ). Since is convex, we have . Let

 t0:=sup{t∈[0,1]|d(x0+t(¯x−x0)),C)=0}.

Since by our assumption , we have . Now take any . We have

 d(x0+t1(¯x−x0)),C) ≤∥[x0+t1(¯x−x0))]−[x0+t1t2(PC(x0+t2(¯x−x0)))−x0))]∥ =t1t2∥x0+t2(¯x−x0))−(PC(x0+t2(¯x−x0)))∥ =t1t2d(x0+t2(¯x−x0),C)

and hence is strictly increasing in for . From this together with the continuity of the distance function we deduce that for every there exists a sufficiently large such that

 d(x0+t′(¯x−x0),C)≤αt<¯α∀t′∈[0,t].

At the same time, by the continuity of for every there exists such that

 d(Pq(x0+t(¯x−x0)),C)≥d(Pq(¯x),C)−ε.

This means that for every we can find and such that

 ψ(α)≥ψ(αt)≥d(Pq(¯x),C)−ε∀α≥αt,

and therefore we have the desired

 liminfα↑¯αψ(α)≥ψ(¯α).

It remains to show the second relation in (3.13). Now let be such that as , and

 limk→∞ψ(αk)=limsupα↓¯αψ(α).

From (3.12) there exists a sequence such that

 d(xk,C)≤αk,ψ(αk)=d(Pq(xk),C).

Without loss of generality this sequence converges to some . By continuity we have

 d(Pq(xk),C)→d(Pq(¯x),C);d(¯x,C)=limk→∞d(xk,C)≤¯α.

Therefore

 limk→∞ψ(αk)=d(Pq(xk),C)=d(Pq(¯x),C)≤ψ(¯α).

###### Proposition 3.8.

Let be a bounded sequence obtained by means of the cyclic projections algorithm, under assumption (2.3), and . Then for any and any there exists a sufficiently large such that

 ∥Pq(xk)−xk+M∥≤ε∀k≥K,(kmodM)+1=q,

where is the exact cyclic projection operator defined by (3.7).

###### Proof.

Using the nonexpansivity of the projection operator (Proposition 2.1 (ii)) we have

 ∥P(xk)−xk+M∥ =∥PCqPCq−1…PCq+1(xk)−PCq(xk+M−1,N)∥ ≤∥PCq−1…PCq+1(xk)−xk+M−1,N∥ ≤∥PCq−1…PCq+1(xk)−xk+M−1∥+∥xk+M−1−xk+M−1,N∥ ≤∥PCq−1…PCq+1(xk)−PCq−1xk+q−2,N∥+∥xk+M−1−xk+M−1,N∥ ≤⋯ ≤M∑i=1∥xk+M−i−xk+M−i,N∥ ≤M∑i=1N∑j=1∥xk+M−i,j−1−xk+M−i,j∥ =M∑i=1λk+M−iN∑j=1∥vk+M−i,j∥ ≤M∑i=1λk+M−iN∑j=1Lj =LM∑i=1λk+M−i

where . Since , we can always find a sufficiently large number to ensure the last term is smaller than for all . ∎

The next proposition brings us closer to the proof of Lemma 3.6.

###### Proposition 3.9.

Assume that is bounded, and condition (2.3) is satisfied. Then

 liminfk→∞d(xk,C)=0.
###### Proof.

Assume that the claim is not true. Then for some starting point the sequence is bounded, but

 liminfk→∞d(xk,C)=D>0.

Let be a subsequence of such that

 limk→∞d(xkl,C)=liminfk→∞d(xk,C)=D.

Without loss of generality we may assume that and that , so that each is obtained after projecting onto .

Since the sequence is bounded, we can define the function (as in Proposition 3.7) on any compact set that contains and some point from which we assumed to be nonempty. By the continuity of proved in Proposition 3.7 we have

 limkl→∞d(Pq(xkl),C