Nonconvex Stochastic Nested Optimization via Stochastic ADMM

Nonconvex Stochastic Nested Optimization via Stochastic ADMM

Zhongruo Wang
Abstract

We consider the stochastic nested composition optimization problem where the objective is a composition of two expected-value functions. We proposed the stochastic ADMM to solve this complicated objective. In order to find an -stationary point where the expected norm of the subgradient of corresponding augmented Lagrangian is smaller than , the total sample complexity of our method is for the online case and for the finite sum case. The computational complexity is consistent with proximal version proposed in [Zhang and Xiao, 2019], but our algorithm can solve more general problem when the proximal mapping of the penalty is not easy to compute.

1 Introduction

Consider we solve the following optimization problem:

(1)

An interesting special case is when follows a uniform distribution:

(2)

2 Motivation and Previous Works

When penalty is not simple as penalty, for example graph guided lasso and fussed lasso, we can’t use simple proximal algorithms. Thus, perform operator splitting and using ADMM will be suitable for those kind of problems; ADMM for general convex and strongly convex cases has been studied in [Yu and Huang, 2017]. In their fomulation, they assume a very special case on the penalty that which is not quite general for most ADMM problems. Using ADMM to solve the same composite nonconvex composite nested objective hasn’t been well studied; different variance reduced stochastic proximal methods have been studied in both convex and nonconvex cases. Proximal version of the algorithms have also been studied for formulations of multiple level composite functions: [Zhang and Xiao, 2019],[Lin et al., 2018], different iteration complexity and stochastic oracle has been analyzed.

3 Contribution

In this work we will present a stochastic variance reduced ADMM algorithm to solve 2-level and multiple level composite stochastic problems for both finite sum and online case. We denote the sampling number to be and the augmented Lagrangian with penalty to be . In order to achieve for a given threshold , for simple mini batch estimation, we can show that iteration complexity is and the total complexity is which is too costy; when using stochastic intergraded estimator like SARAH/SPIDER, we can show that the total sampling complexity is for the online case and for the finite sum case.

4 Assumptions and Notations

The following assumptions are made for the further analysis of the algorithms:

  1. .

  2. and has full column rank or full row rank.

  3. is -smooth

  4. and are two smooth vector mapping, and each realization of the random mapping is -Lipschitz continuous and its Jacobian are -Lipschitz continuous.

  5. for all

  6. for all

  7. is a convex regularizer such as ,

  • and denotes the smallest and largest eigenvalue of the matrix , and denotes the smallest and largest eigenvalue of for all .

Definition 4.1.

For any , the point is said to be an stationary point of the nonconvex problem (1) if it holds that:

(3)

where , denotes the subgradient of . If , the point is said to be a stationary point.

The above inequalities (3) are equivalent to , where:

(4)

and is the Lagrangian function of the objective function (1).

5 Main Result

From the perspective of all the stochastic algorithm, the goal is to estimate the gradient as good as we can. The gradient of can be derived from chain rule, from which we will have:

(5)

Now we want to use the abbreviation to denote the approximations:

Then the overall estimator for the gradient is . To solve the problem by using stochastic ADMM, we first give the augmented Lagrangian function of the problem:

(6)

Due to the stochastic gradient of the function to update , we use the approximate Lagrangian over with the estimated gradient :

(7)

In order to avoid computing the inverse of , we can set with to linearize the quadratic term . Also, in order to compute the proximal operater for each , we can set with for all to linearize the term: . The question remains now is how to find a suitable gradient estimator for the composite function.

Now we are ready to define the -staionary point of the solution:

In the following the sections, we first consider about the mini-batch estimation on the gradient, we show that ADMM still convergence by using this simple implementations after suitable choice of parameters. After that, we consider use SARAH/SPIDER to estimate the nested gradient. By comparing the sampling complexity, we can show that SARAH/SPIDER based algorithm is more efficient than traditional mini-batch based algorithm.

6 Simple Mini-Batch Estimator

When facing the stochastic composite objective, one simple and straight forward xstrategy is to estimate the composite gradient by using mini batch. We denote to be the mini-batch estimation of a funtion . Since we are computing the composite gradient, we will use mini batch strategy on computing the gradient and sampling the function value at each level. Here comes with the following algorithm.

1 Initialization: Initial Point , Batch size: ,
2 for  to  do
3       Randomly sample batch of with ;
4      
5       Randomly sample batch of with , and with
6      
7      
8       Calculated the nested gradient estimation:
9       for all
10      
11      
12 end for
Output: choosen uniformly random from
Algorithm 1 Stochastic Nested ADMM with simple Mini Batch estimator

From the algorithm we can see that even though is a biased estimation for the full gradient, we can still analysis on the variance of the approximation and make it small. Firstly, in each iteration from [Zhang and Xiao, 2019], we know that:

By using mini batch estimator, we will have the variance on each estimator to be:

Also, we can have:

(8)

Now, the variance bound on the estimated gradient by conditioning on the batches is:

(9)

Now, we are ready to analysis the convergence of the our proposed ADMM based on SARAH/SPIDER estimator.

Lemma 6.1 (Bound on the dual variable).

Given the sequence is generated by Algorithm (1), we will have the bound on updating the dual variable to be:

(10)
Proof.

By using the optimal condition of step in the algorithm 2, we will have:

(11)

By the updating rule on the dual variable, we will have:

(12)

It follows that:

(13)

where is the pseudoinverse of .

Taking expectation conditioned on :

(14)

Now we want to prove the upper bound of :

(15)

Where the last inequality follows from (9).

In the end, we will have the bound on updating the dual variable to be:

(16)

Lemma 6.2 (Point convergence).
Proof.

By the optimal condition of step 9 in algorithm 1, we will have:

(17)

Now, we will have the decrease bound on update the component is:

(18)

Since we know that is -smooth, we will have:

Now, using the optimal condition on updating the component in the algorithm, we will have

(19)

Combine two equation above, we will have:

(20)

Thus, rearranging the equations, taking expectation on the batches and , we will have:

(21)

Now, using the update of in the algorithm, we will have:

(22)

Now, combining (18),(21) and (22), we will have:

(23)