Hybrid Variance-Reduced SGD Algorithms For Minimax Problems with Nonconvex-Linear Function

Hybrid Variance-Reduced SGD Algorithms For Minimax Problems with Nonconvex-Linear Function

Abstract

We develop a novel and single-loop variance-reduced algorithm to solve a class of stochastic nonconvex-convex minimax problems involving a nonconvex-linear objective function, which has various applications in different fields such as machine learning and robust optimization. This problem class has several computational challenges due to its nonsmoothness, nonconvexity, nonlinearity, and non-separability of the objective functions. Our approach relies on a new combination of recent ideas, including smoothing and hybrid biased variance-reduced techniques. Our algorithm and its variants can achieve -convergence rate and the best known oracle complexity under standard assumptions, where is the iteration counter. They have several computational advantages compared to existing methods such as simple to implement and less parameter tuning requirements. They can also work with both single sample or mini-batch on derivative estimators, and with constant or diminishing step-sizes. We demonstrate the benefits of our algorithms over existing methods through two numerical examples, including a nonsmooth and nonconvex-non-strongly concave minimax model.

1 Introduction

We study the following stochastic minimax problem with nonconvex-linear objective function, which covers various practical problems in different fields, see, e.g., Ben-Tal2009 (); Facchinei2003 (); goodfellow2014generative ():

(1)

where is a stochastic vector function defined on a probability space , is a given matrix, is an inner product, and and are proper, closed, and convex functions Bauschke2011 (). Problem (1) is a special case of the nonconvex-concave minimax problem, where is nonconvex in and linear in .

Due to the linearity of w.r.t. , (1) can be reformulated into a general stochastic compositional nonconvex problem of the form:

(2)

where is a convex, but possibly nonsmooth function, defined as

(3)

with being the Fenchel conjugate of Bauschke2011 (), and we define . Note that problem (2) is completely different from existing models such as drusvyatskiy2019efficiency (); duchi2018stochastic (), where the expectation is inside the outer function , i.e., . We refer to this setting as a “non-separable” model.

Challenges: Developing numerical methods for solving (1) or (2) faces several challenges. First, it is often nonconvex, i.e., is not affine. Many recent papers consider special cases of (2) when in (2) is convex by imposing restrictive conditions, which are unfortunately not realistic in applications. Second, the max-form in (3) is often nonsmooth if is not strongly convex. This prevents the use of gradient-based methods. Third, since the expectation is inside , it is very challenging to form an unbiased estimate for [sub]gradients of , making classical stochastic gradient-based methods inapplicable. Finally, prox-linear operator-based methods as in drusvyatskiy2019efficiency (); duchi2018stochastic (); tran2020stochastic (); zhang2020stochastic () require large mini-batch evaluations of both function value and its Jacobian , see tran2020stochastic (); zhang2019multi (); zhang2020stochastic (), instead of single sample or small mini-batch, making them less flexible and more expensive than gradient-based methods.

Related work: Problem (1) has recently attracted considerable attention due to key applications, e.g., in game theory, robust optimization, distributionally robust optimization, and generative adversarial nets (GANs) Ben-Tal2009 (); Facchinei2003 (); goodfellow2014generative (); rahimian2019distributionally (). Various first-order methods have been developed to solve (1) during the past decades for both convex-concave models , e.g., Bauschke2011 (); Korpelevic1976 (); Nemirovskii2004 (); tseng2008accelerated () and nonconvex-concave settings lin2018solving (); lin2019gradient (); loizou2020stochastic (); ostrovskii2020efficient (); thekumparampil2019efficient (). Some recent works consider a nonnonvex-nonconcave formulation, e.g., nouiehed2019solving (); yang2020global (). However, they still rely on additional assumptions to guarantee that the maximization problem in (3) can globally be solved. One well-known assumption is the Polyak-Łojasiewicz (PL) condition, which is rather strong and often used to guarantee linear convergence rates. A majority of these works focus on deterministic models, while some methods have been extended to stochastic settings, e.g., lin2018solving (); yang2020global (). Although (1) is a special case of a general model in lin2018solving (); lin2019gradient (); yang2020global (), it almost covers all examples in lin2018solving (); yang2020global (). Compared to these, we only consider a special class of minimax problems where the function is linear in . However, our algorithm is rather simple with a single loop, and our oracle complexity is significantly improved over the ones in lin2018solving (); yang2020global ().

In a very recent work luo2020stochastic (), which is concurrent to our paper, the authors develop a double-loop algorithm, called SREDA, to handle a more general case than (1) where is strongly concave in . Their method exploits the SARAH estimator introduced in nguyen2017sarah () and can achieve the same oracle complexity as ours in Theorem 3.1 below. Compared to our work here, though the problem setting in luo2020stochastic () is more general than (1), it does not cover the non-strongly convex case. This is important to handle stochastic constrained optimization problems, where is nonsmooth and convex, but not necessarily strongly convex (see, e.g., (32) below as an example). Moreover, the SREDA algorithm in luo2020stochastic () requires double loops with large mini-batch sizes in both function values and derivatives and uses small learning rates to achieve the desired oracle complexity.

It is interesting that the minimax problem (1) can be reformulated into a nonconvex compositional optimization problem of the form (2). The formulation (2) has been broadly studied in the literature under both deterministic and stochastic settings, see, e.g., drusvyatskiy2019efficiency (); duchi2018stochastic (); Lewis2008 (); Nesterov2007g (); Tran-Dinh2011 (); wang2017stochastic (). If and , then (2) reduces to the standard stochastic optimization model studied e.g., in ghadimi2016accelerated (); Pham2019 (). In the deterministic setting, one common method to solve (2) is the prox-linear-type method, which is also known as a Gauss-Newton method Lewis2008 (); Nesterov2007g (). This method has been studied in several papers, including drusvyatskiy2019efficiency (); duchi2018stochastic (); Lewis2008 (); Nesterov2007g (); Tran-Dinh2011 (). However, the prox-linear operator often does not have a closed form expression, and its evaluation may require solving a general nonsmooth strongly convex subproblem.

In the stochastic setting as (2), wang2017stochastic (); wang2017accelerating () proposed stochastic compositional gradient methods to solve more general forms than (2), but they required a set of stronger assumptions than Assumptions 2.1-2.2 below, including the smoothness of . Recent related works include lian2017finite (); liu2017variance (); xu2019katyusha (); yang2019multilevel (); yu2017fast (), which also rely on similar ideas. For instance, lin2018solving () proposed a double loop subgradient-based method with oracle complexity. Another subgradient-based method was recently proposed in yang2020global () based on a two-side PL condition. Stochastic methods exploiting prox-linear operators have also been recently proposed in tran2020stochastic (); zhang2020stochastic (), which are essentially extensions of existing deterministic methods to (2). Together with algorithms, convergence guarantees, stochastic oracle complexity bounds have also been estimated. For instance, wang2017stochastic () obtained oracle complexity for (2), while it was improved to in wang2017accelerating (). Recent works zhang2019multi (); zhang2019stochastic () further improved the complexity to . These methods require the smoothness of both and , use large batch sizes, and need a double-loop scheme. In contrast, our method has single loop, can work with either single sample or mini-batch, and allows both constant or diminishing step-sizes. For nonsmooth , under the same assumptions as tran2020stochastic (); zhang2020stochastic (), our methods achieve Jacobian and function evaluation complexity as in those papers. However, our method is gradient-based, which only uses proximal operator of and instead of a complex prox-linear operator as in tran2020stochastic (); zhang2020stochastic (). Note that even if and have closed-form proximal operator, the prox-linear operator still does not have a closed-form solution, and requires to solve a composite and possibly nonsmooth strongly convex subproblem involving a linear operator, see, e.g., tran2020stochastic (). Moreover, our method can work with both single sample and mini-batch for Jacobian compared to a large batch size as in tran2020stochastic (); zhang2020stochastic ().

Our contribution: Our main contribution in this paper can be summarized as follows:

  • We develop a new single-loop hybrid variance-reduced SGD algorithm to handle (1) under Assumptions 2.1 and 2.2 below. Under the strong convexity of , our algorithm has convergence rate to approximate a KKT (Karush-Kuhn-Tucker) point of (1), where is the batch size and is the iteration counter. We also estimate an -oracle complexity to obtain an -KKT point, matching the best known one as, e.g., in luo2020stochastic (); zhang2019multi (); zhang2019stochastic (). Our complexity bound holds for a wide range of as opposed to a specific choice as in luo2020stochastic (); zhang2019multi (); zhang2019stochastic (). Moreover, our algorithm has only a single loop compared to luo2020stochastic (); zhang2019multi ()..

  • When is non-strongly convex, we combine our approach with a smoothing technique to develop a gradient-based variant, that can achieve the best-known Jacobian and function evaluations of for finding an -KKT point of (1). Moreover, our algorithm does not require prox-linear operators and large batches for Jacobian as in tran2020stochastic (); zhang2020stochastic ().

  • We also propose a simple restarting technique without sacrificing convergence guarantees to accelerate the practical performance of both cases (a) and (b) (see Supp. Doc. C).

Our methods exploit a recent biased hybrid estimators introduced in Tran-Dinh2019a () as opposed to SARAH ones in tran2020stochastic (); zhang2019multi (); zhang2020stochastic (). This allows us to simplify our algorithm with a single loop and without large batches at each iteration compared to zhang2019multi (). As indicated in arjevani2019lower (), our oracle complexity is optimal under the considered assumptions. If is non-strongly convex (i.e. in (2) can be nonsmooth), then our algorithm is fundamentally different from the ones in tran2020stochastic (); zhang2020stochastic () as it does not use prox-linear operator. Note that evaluating a prox-linear operator requires to solve a general strongly convex but possible nonsmooth subproblem. In addition, they only work with large batch sizes of both and .

Content: Section 2 states our assumptions and recalls some mathematical tools. Section 3 develops a new algorithm and analyzes its convergence. Section 4 provides two numerical examples to compare our methods. All technical details and proofs are deferred to Supplementary Document (Supp. Doc.).

2 Basic assumptions, KKT points and smoothing technique

Notation: We work with finite-dimensional space equipped with standard inner product and Euclidean norm . For a function , denotes its domain. If is convex, then denotes its proximal operator, denotes its subdifferential, and is its [sub]gradient, see, e.g., Bauschke2011 (). is -strongly convex with a strongly convex parameter if remains convex. For a smooth vector function , denotes its Jacobian. We use to denote the Euclidean distance from to a convex set .

2.1 Model assumptions

Let denote the expectation function of and denote the domain of . Throughout this paper, we always assume that in (2) and is proper, closed, and convex without recalling them in the sequel. Our goal is to develop stochastic gradient-based algorithms to solve (1) relying on the following assumptions:

Assumption 2.1.

The function in problem (1) or (2) satisfies the following assumptions:

  • Smoothness: is -average smooth with , i.e.:

    (4)
  • Bounded variance: There exists two constants such that

  • Lipschitz continuity: is -average Lipschitz continuous with , i.e.:

    (5)

Note that Assumptions 2.1 are standard in stochastic nonconvex optimization, see tran2020stochastic (); zhang2019multi (); zhang2019stochastic (); zhang2020stochastic (). If is bounded, then is bounded, and this assumption automatically holds.

For , we only require the following assumption, which is mild and holds for many applications.

Assumption 2.2.

The function in (1) is proper, closed, and convex. Moreover, is bounded by , i.e.: .

An important special case of is the indicator of convex and bounded sets. Hitherto, we do not require and in (2) to be smooth or strongly convex. They can be nonsmooth so that (2) can also cover constrained problems. Note that the boundedness of is equivalent to the Lipschitz continuity of (Lemma A.1). Simple examples of include norms and gauge functions.

2.2 KKT points and approximate KKT points

Since (1) is nonconvex-concave, a pair is said to be a KKT point of (1) if

(6)

From (6), we have . Substituting this into the first expression, we get

(7)

Here, we have used , where is given by (3) This inclusion shows that is a stationary point of (2). In the convex-concave case, under mild assumptions, a KKT point is also a saddle-point of (1). In particular, if (2) is convex, then is also its global optimum of (2).

However, in practice, we can only find an approximation of a KKT point for (1).

Definition 2.1.

Given any tolerance , is called an -KKT point of (1) if

(8)

Here, the expectation is taken overall the randomness from both model (1) and the algorithm. Clearly, if , then is a KKT point of (1) as characterized by (6).

2.3 Smoothing techniques

Under Assumption 2.2, defined by (3) can be nonsmooth. Hence, we can smooth as follows:

(9)

where is a continuously differentiable and -strongly convex function such that , and is a smoothness parameter. For example, we can choose for a fixed or defined on a standard simplex Nesterov2005c (). Under Assumption 2.2, possesses some useful properties as stated in Lemma A.1 (Supp. Doc. A.1).

Let be an optimal solution of the maximization problem in (9), which always exists and is unique. In particular, if , then

(10)

Hence, when is proximally tractable (i.e., its proximal operator can be computed in a closed-form or with a low-order polynomial time algorithm), computing reduces to evaluating the proximal operator of as opposed to solving a complex subproblem as in prox-linear methods tran2020stochastic (); zhang2020stochastic ().

Given defined by (9), we consider the following functions:

(11)

In this case, under Assumptions 2.1 and 2.2, is continuously differentiable, and

(12)

Smoothness: Moreover, is -smooth with (see zhang2019stochastic ()), i.e.:

(13)

where and are given in Lemma A.1.

Gradient mapping: Let us recall the following gradient mapping of given in (11) as

(14)

This mapping will be used to characterize approximate KKT points of (1) in Definition 2.1.

3 The proposed algorithm and its convergence analysis

First, we introduce a stochastic estimator for . Then, we develop our main algorithm and analyze its convergence and oracle complexity. Finally, we show how to construct an -KKT point of (1).

3.1 Stochastic estimators and the algorithm

Since is the expectation of a stochastic function , we exploit the hybrid stochastic estimators for and its Jacobian introduced in Tran-Dinh2019a (). More precisely, given a sequence generated by a stochastic algorithm, our hybrid stochastic estimators and are defined as follows:

(15)

where are given weights, and the initial estimators and are defined as

(16)

Here, , , , , , and are mini-batches of sizes , , , , , and , respectively. We allow to be correlated with , and to be correlated with . We also do not require any independence between these mini-batches. When and , our estimators reduce the STORM estimators studied in Cutkosky2019 () as a special case. Clearly, with the choices and , we can save function value evaluations and Jacobian evaluations at each iteration.

For and defined by (15), we introduce a stochastic estimator for the gradient of in (11) at as follows:

(17)

To evaluate , we need to compute , which requires just one if we use (10). Moreover, due to (16) and (17), evaluating does not require the full matrix , but a matrix-vector product , which is often cheaper than evaluating .

Using the new estimator of in (17), we propose Algorithm 1 to solve (1).

1:Inputs: An arbitrarily initial point .
2:     Input , , , and (specified in Subsection 3.2).
3:Initialization: Generate and as in (16) with mini-batch sizes and , respectively.
4:     Solve (9) to obtain . Then, evaluate .
5:     Update and .
6:For do
7:     Construct and as in (15) and , where solves (9).
8:     Update and .
9:     Update , , and if necessary.
10:EndFor
11:Output: Choose randomly from with .
Algorithm 1 (Smoothing Hybrid Variance-Reduced SGD Algorithm for solving (1))

Algorithm 1 is designed by adopting the idea in Tran-Dinh2019a (), where it can start from two initial batches and to generate a good approximation for the search direction before getting into the main loop. But if diminishing step-sizes are use, it does not require such initial batchs. However, it has major differences compared to Tran-Dinh2019a (): the dual step , the estimator , and the dynamic parameter updates. Note that, as explained in (10), since the dual step can be computed using , Algorithm 1 is single loop, making it easy to implement in practice compared to methods based on SVRG johnson2013accelerating () and SARAH nguyen2017sarah () such as luo2020stochastic (); zhang2019multi ().

3.2 Convergence analysis of Algorithm 1

Let be the -field generated by Algorithm 1 up to the -th iteration, which is defined as follows:

(18)

If is strongly convex, then, without loss of generality, we can assume . Otherwise, we can rescale it. Moreover, for the sake of our presentation, for a given , we introduce:

(19)

where , , , , and are given in Assumption 2.1 and is in Assumption 2.2. Here, if the mini-batch is independent of , and , otherwise. Similarly, if is independent of , and , otherwise.

The strongly concave case

Theorem 3.1, whose proof is in Supp. Doc. B.3, analyzes convergence rate and complexity of Algorithm 1 for the smooth case of in (2) (i.e., is strongly convex).

Theorem 3.1 (Constant step-size).

Suppose that Assumptions 2.1 and 2.2 hold, is -strongly convex with , and , , and are defined in (19). Given a mini-batch , let , , and . Let be generated by Algorithm 1 using

(20)

provided that . Let for some . Then, we have

(21)

For a given tolerance , the total number of iterations to obtain is at most . The total numbers of function evaluation and its Jacobian evaluations are at most and , respectively.

Theorem 3.2 states convergence of Algorithm 1 using diminishing step-size (see Supp. Doc. B.4).

Theorem 3.2 (Diminishing step-size).

Suppose that Assumptions 2.1 and 2.2 hold, is -strongly convex with i.e., in (2) is smooth. Let be generated by Algorithm 1 using the mini-batch sizes as in Theorem 3.1, and increasing weight and diminishing step-sizes as

(22)

Then, for all , and chosen as , we have

(23)

If we set , then our convergence rate is with a factor slower than (21). However, it does not require a large initial mini-batch as in Theorem 3.1. In Theorems 3.1 and 3.2, we do not need to smooth . Hence, is absent in Algorithm 1, i.e., for .

The non-strongly concave case

Now, we consider the case , i.e., is non-strongly convex (or equivalently, (1) is non-strongly concave in ), leading to the nonsmoothness of in (2). Theorem 3.3 states convergence of Algorithm 1 in this case, whose proof is in Supp. Doc. B.5.

Theorem 3.3 (Constant step-size).

Assume that Assumptions 2.1 and 2.2 hold, in (1) is non-strongly convex i.e., is nonsmooth, and , , and are defined in (19). Let and be two positive integers, , and be generated by Algorithm 1 after iterations using:

(24)

Then, with defined in Lemma A.1, the following bound holds

(25)

The total number of iterations to achieve is at most . The total numbers of function evaluations and Jacobian evaluations are respectively at most

If we choose for some , then .

Alternatively, we can also establish convergence and estimate the complexity of Algorithm 1 with diminishing step-size in Theorem 3.4, whose proof is in Supp. Doc. B.6.

Theorem 3.4 (Diminishing step-size).

Suppose that Assumptions 2.1 and 2.2 hold, is non-strongly convex i.e., is possibly nonsmooth, and , , are defined by (19). Given mini-batch sizes and , let , , and for some . Let be generated by Algorithm 1 using increasing weight and diminishing step-sizes as

(26)

For chosen as , we have

(27)

Note that since (diminishing) and , we have , which shows that the mini-batch sizes of the function estimation are chosen in increasing manner (not fixed at a large size for all ), which can save computational cost for . The batch sizes and in Theorems 3.3 and 3.4 must be chosen to guarantee .

3.3 Constructing approximate KKT point for (1) from Algorithm 1

Existing works such as luo2020stochastic (); zhang2019multi (); zhang2020stochastic () do not show how to construct an -KKT point of (1) or an -stationary point of (2) from with . Lemma 3.1, whose proof is in Supp. Doc. A.3, shows one way to construct an -KKT point of (1) in the sense of Definition 2.1 with from the output of Theorems 3.1, 3.2, 3.3, and 3.4.

Lemma 3.1.

Let be computed by Algorithm 1 up to an accuracy after iterations. Assume that we can approximate , , and , respectively such that

(28)

Let us denote and compute as

(29)

Suppose that and for a constant . Then