Nonconvex Stochastic Nested Optimization via Stochastic ADMM

# Nonconvex Stochastic Nested Optimization via Stochastic ADMM

Zhongruo Wang
###### Abstract

We consider the stochastic nested composition optimization problem where the objective is a composition of two expected-value functions. We proposed the stochastic ADMM to solve this complicated objective. In order to find an -stationary point where the expected norm of the subgradient of corresponding augmented Lagrangian is smaller than , the total sample complexity of our method is for the online case and for the finite sum case. The computational complexity is consistent with proximal version proposed in [Zhang and Xiao, 2019], but our algorithm can solve more general problem when the proximal mapping of the penalty is not easy to compute.

## 1 Introduction

Consider we solve the following optimization problem:

 minx∈Rd,y∈RlF(x)+m∑j=1rj(y)=Eξ2f2,ξ2(Eξ1f1,ξ1(x))+m∑j=1rj(yj) s.t. Ax+m∑j=1Bjyj=c (1)

An interesting special case is when follows a uniform distribution:

 minx∈Rd,y∈RlF(x)+m∑j=1rj(y)=1N2N2∑j=1f2,j(1N1N1∑j=1f1,j(x))+m∑j=1rj(yj) s.t. Ax+m∑j=1Bjyj=c (2)

## 2 Motivation and Previous Works

When penalty is not simple as penalty, for example graph guided lasso and fussed lasso, we can’t use simple proximal algorithms. Thus, perform operator splitting and using ADMM will be suitable for those kind of problems; ADMM for general convex and strongly convex cases has been studied in [Yu and Huang, 2017]. In their fomulation, they assume a very special case on the penalty that which is not quite general for most ADMM problems. Using ADMM to solve the same composite nonconvex composite nested objective hasn’t been well studied; different variance reduced stochastic proximal methods have been studied in both convex and nonconvex cases. Proximal version of the algorithms have also been studied for formulations of multiple level composite functions: [Zhang and Xiao, 2019],[Lin et al., 2018], different iteration complexity and stochastic oracle has been analyzed.

## 3 Contribution

In this work we will present a stochastic variance reduced ADMM algorithm to solve 2-level and multiple level composite stochastic problems for both finite sum and online case. We denote the sampling number to be and the augmented Lagrangian with penalty to be . In order to achieve for a given threshold , for simple mini batch estimation, we can show that iteration complexity is and the total complexity is which is too costy; when using stochastic intergraded estimator like SARAH/SPIDER, we can show that the total sampling complexity is for the online case and for the finite sum case.

## 4 Assumptions and Notations

The following assumptions are made for the further analysis of the algorithms:

1. .

2. and has full column rank or full row rank.

3. is -smooth

4. and are two smooth vector mapping, and each realization of the random mapping is -Lipschitz continuous and its Jacobian are -Lipschitz continuous.

5. for all

6. for all

7. is a convex regularizer such as ,

• and denotes the smallest and largest eigenvalue of the matrix , and denotes the smallest and largest eigenvalue of for all .

###### Definition 4.1.

For any , the point is said to be an stationary point of the nonconvex problem (1) if it holds that:

 ⎧⎪⎨⎪⎩E∥Ax∗+By∗−c∥22≤ϵ2E∥∇f(x∗)−ATλ∗∥22≤ϵ2E∥dist(BTλ∗,∂r(y∗))2∥22≤ϵ2 (3)

where , denotes the subgradient of . If , the point is said to be a stationary point.

The above inequalities (3) are equivalent to , where:

 ∂L(x,y,λ)=⎡⎢⎣∂L(x,y,λ)/∂x∂L(x,y,λ)/∂y∂L(x,y,λ)/∂λ⎤⎥⎦ (4)

and is the Lagrangian function of the objective function (1).

## 5 Main Result

From the perspective of all the stochastic algorithm, the goal is to estimate the gradient as good as we can. The gradient of can be derived from chain rule, from which we will have:

 F′(x)=(Eξ1[f′1,ξ1(x)])Eξ2[f′2,ξ2(Eξ1f1,ξ1(x))] (5)

Now we want to use the abbreviation to denote the approximations:

 Yk1≈f1(xk),Zk1≈f′1(xk),Zk2≈f′2(Yk1)

Then the overall estimator for the gradient is . To solve the problem by using stochastic ADMM, we first give the augmented Lagrangian function of the problem:

 Lρ(x,y[m],z)=F(x)+m∑j=1gj(yj)−⟨z,Ax+m∑j=1Bjyj−c⟩+ρ2∥Ax+m∑j=1Bjyj−c∥22 (6)

Due to the stochastic gradient of the function to update , we use the approximate Lagrangian over with the estimated gradient :

 ^Lρ(x,y[m],zk,vk)= F(xk)+vTk(x−xk)+12η∥x−xk∥2G (7) +m∑j=1gj(yk+1j)−⟨zk,Ax+m∑j=1 Bjyj−c⟩+ρ2∥Ax+m∑j=1Bjyk+1j−c∥22

In order to avoid computing the inverse of , we can set with to linearize the quadratic term . Also, in order to compute the proximal operater for each , we can set with for all to linearize the term: . The question remains now is how to find a suitable gradient estimator for the composite function.

Now we are ready to define the -staionary point of the solution:

In the following the sections, we first consider about the mini-batch estimation on the gradient, we show that ADMM still convergence by using this simple implementations after suitable choice of parameters. After that, we consider use SARAH/SPIDER to estimate the nested gradient. By comparing the sampling complexity, we can show that SARAH/SPIDER based algorithm is more efficient than traditional mini-batch based algorithm.

## 6 Simple Mini-Batch Estimator

When facing the stochastic composite objective, one simple and straight forward xstrategy is to estimate the composite gradient by using mini batch. We denote to be the mini-batch estimation of a funtion . Since we are computing the composite gradient, we will use mini batch strategy on computing the gradient and sampling the function value at each level. Here comes with the following algorithm.

From the algorithm we can see that even though is a biased estimation for the full gradient, we can still analysis on the variance of the approximation and make it small. Firstly, in each iteration from [Zhang and Xiao, 2019], we know that:

 ∥vk−F′(xk)∥22≤3∥Zt1∥22(∥Zk2−f′2(yk)∥22+L22∥Yk1−f1(xk)∥22)+3ℓ22∥Zk1−f′1(xk)∥22

By using mini batch estimator, we will have the variance on each estimator to be:

 E∥Zk2−f′2(yk)∥22≤σ22b2,E∥Yk1−f1(xk)∥22≤δ2sE∥Zk1−f′1(xk)∥22≤σ21b1

Also, we can have:

 ∥Zk1∥=∥f′1,ξ1∥+∥1b∑ξ∈Br1(f′1,ξ1(xr)−f′1,ξ1(xr−1))∥≤∥f′1,ξ1∥+∥f′1,ξ1(xr)∥+∥f′1,ξ1(xr−1)∥=3ℓ1 (8)

Now, the variance bound on the estimated gradient by conditioning on the batches is:

 E∥vk−F′(xk)∥22≤27ℓ21σ22b2+27ℓ21L22δ2s+3ℓ22σ21b1C (9)

Now, we are ready to analysis the convergence of the our proposed ADMM based on SARAH/SPIDER estimator.

###### Lemma 6.1 (Bound on the dual variable).

Given the sequence is generated by Algorithm (1), we will have the bound on updating the dual variable to be:

 E∥zk+1−zk∥22≤18CσAmin+3σ2max(G)σAminη2E∥xk+1−xk∥22+(9L2σAmin+3σ2max(G)σAminη2)∥xk−xk−1∥22 (10)
###### Proof.

By using the optimal condition of step in the algorithm 2, we will have:

 vk+Gη(xk+1−xk)−ATzk+ρAT(Axk+1+m∑j=1Bjyk+1j−c)=0 (11)

By the updating rule on the dual variable, we will have:

 ATzk+1=vk+Gη(xk+1−xk) (12)

It follows that:

 zk+1=(AT)+(vk+Gη(xk+1−xk)) (13)

where is the pseudoinverse of .

Taking expectation conditioned on :

 E∥zk+1−zk∥22 (14) = E∥(AT)+(vk+Gη(xk+1−xk)−vk−1−Gη(xk−xk−1))∥22 ≤ 1σAmin∥vk+Gη(xk+1−xk)−vk−1−Gη(xk−xk−1)∥22 ≤ 1σAmin[3E∥vk−vk−1∥22+3σ2max(G)η2E∥xk+1−xk∥22+3σ2max(G)η2∥xk−xk−1∥22]

Now we want to prove the upper bound of :

 E∥vk−vk−1∥22 (15) = E∥vk−∇f(xk)+∇f(xk)−∇f(xk−1)+∇f(xk−1)−vk−1∥22 ≤ 3E∥vk−∇f(xk)∥22+3∥∇f(xk)−∇f(xk−1)∥22+3E∥vk−1−∇f(xk−1)∥22 ≤ 6C+3L2∥xk−1−xk∥22

Where the last inequality follows from (9).

In the end, we will have the bound on updating the dual variable to be:

 E∥zk+1−zk∥22≤18CσAmin+3σ2max(G)σAminη2E∥xk+1−xk∥22+(9L2σAmin+3σ2max(G)σAminη2)∥xk−xk−1∥22 (16)

###### Proof.

By the optimal condition of step 9 in algorithm 1, we will have:

 0= (ykj−yk+1j)T(∂gj(yk+1j)−BTjzk+ρBTj(Axk+j∑i=1Biyk+1i+m∑i=j+1Biyki−c)+Hj(yk+1j−ykj)) (17) ≤ gj(ykj)−gj(yk+1j)−(zk)T(Bjykj−Bjyk+1j)+ρ(Bjykj−Bjyk+1j)T(Axk+j∑i=1Biyk+1i+m∑i=j+1Biyki−c)−∥yk+1j−ykj∥2Hj ≤ gj(ykj)−gj(yk+1j)−zTk(Axk+j−1∑i=1Biyk+1i+m∑i=jBiyki−c)+zTk(Axk+j∑i=1Biyk+1i+m∑i=j+1Biyki−c) +ρ2∥Axk+j−1∑i=1Biyk+1i+m∑i=jBiyki−c∥22+ρ2∥Axk+j∑i=1Biyk+1i+m∑i=j+1Biyki−c∥22 −ρ2∥Bjykj−Bjyk+1j∥22−∥yk+1j−ykj∥2Hj ≤ Lρ(xk,yk+1j−1,yk[j:m],zk)−Lρ(xk,yk+1j,yk[j+1:m],zk)−ρ2∥Bjykj−Bjyk+1j∥22−∥yk+1j−ykj∥2Hj ≤ Lρ(xk,yk+1j−1,yk[j:m],zk)−Lρ(xk,yk+1j,yk[j+1:m],zk)−σmin(Hj)∥ykj−yk+1j∥22

Now, we will have the decrease bound on update the component is:

 Lρ(xk,yk+1j,yk[j+1:m],zk)−Lρ(xk,yk+1j−1,yk[j:m],zk)≤−σmin(Hj)∥ykj−yk+1j∥22 (18)

Since we know that is -smooth, we will have:

 F(xk+1)≤F(xk)+⟨∇F(xk),xk+1−xk⟩+LF2∥xk+1−xk∥22

Now, using the optimal condition on updating the component in the algorithm, we will have

 0=(xk−xk+1)T(vk+Gη(xk+1−xk)−ATzk+ρAT(Axk+1+m∑j=1Bjyk+1j−c)) (19)

Combine two equation above, we will have:

 0≤ f(xk)−f(xk+1)+∇f(xk)T(xk+1−xk)+LF2∥xk+1−xk∥22 (20) +(xk−xk+1)T(vk+Gη(xk+1−xk)−ATzk+ρAT(Axk+1+m∑j=1Bjyk+1j−c)) ≤ f(xk)−f(xk+1)+L2∥xk+1−xk∥22−1η∥xk−xk+1∥2G+(xk−xk+1)T(vk−∇f(xk)) −(zk)T(Axk−Axk+1)+ρ(Axk−Axk+1)T(Axk+1+m∑j=1Bjyk+1j−c) ≤ f(xk)−f(xk+1)+LF2∥xk+1−xk∥22−1η∥xk−xk+1∥2G+(xk−xk+1)T(vk−∇f(xk)) −zTk(Axk+m∑j=1Bjyk+1j−c)+zTk(Axk+1+m∑j=1Bjyk+1j−c) +ρ2∥Axk+m∑j=1Bjyk+1j−c∥22−ρ2∥Axk+1+m∑j=1Bjyk+1j−c∥22−ρ2∥Axk−Axk+1∥22 = Lρ(xk,yk+1[m],zk)−Lρ(xk+1,yk+1[m],zk) +LF2∥xk+1−xk∥22−1η∥xk−xk+1∥2G+(xk−xk+1)T(vk−∇f(xk))−ρ2∥Axk−Axk+1∥22 = Lρ(xk,yk+1[m],zk)−Lρ(xk+1,yk+1[m],zk)−(σmin(G)η+ρσAmin2−LF2)∥xk+1−xk∥22+⟨xk−xk+1,vk−∇f(xk)⟩ = Lρ(xk,yk+1[m],zk)−Lρ(xk+1,yk+1[m],zk)−(σmin(G)η+ρσAmin2−LF)∥xk+1−xk∥22+12LF∥vk−∇f(xk)∥22 = Lρ(xk,yk+1[m],zk)−Lρ(xk+1,yk+1[m],zk)−(σmin(G)η+ρσAmin2−LF)∥xk+1−xk∥22+C2LF

Thus, rearranging the equations, taking expectation on the batches and , we will have:

 ELρ(xk+1,yk+1[m],zk)−Lρ(xk,yk+1[m],zk)≤−(σmin(G)η+ρσAmin2−LF)E∥xk+1−xk∥22+C2LF (21)

Now, using the update of in the algorithm, we will have:

 Lρ(xk+1,yk+1[m],zk+1)−Lρ(xk+1,yk+1[m],zk) (22) = 1ρ∥zk+1−zk∥22 = 18CρσAmin+3σ2max(G)ρσAminη2E∥xk+1−xk∥22+(9L2ρσAmin+3σ2max(G)ρσAminη2)∥xk−xk−1∥22

Now, combining (18),(21) and (22), we will have:

 Lρ(xk+1,yk+1[m],zk+1)−Lρ(xk,yk[m],zk) (23) ≤ −m∑j=1σmin(Hj)∥ykj−yk+1j∥22−(σmin(G)η+ρσAmin2−LF)E∥xk+1−xk∥22+C2LF +18CρσAmin+3σ2max(G)ρσAminη2E∥xk+1−xk∥22+(9L2FρσAmin+3σ2max(G