Hybrid VarianceReduced SGD Algorithms For Minimax Problems with NonconvexLinear Function
Abstract
We develop a novel and singleloop variancereduced algorithm to solve a class of stochastic nonconvexconvex minimax problems involving a nonconvexlinear objective function, which has various applications in different fields such as machine learning and robust optimization. This problem class has several computational challenges due to its nonsmoothness, nonconvexity, nonlinearity, and nonseparability of the objective functions. Our approach relies on a new combination of recent ideas, including smoothing and hybrid biased variancereduced techniques. Our algorithm and its variants can achieve convergence rate and the best known oracle complexity under standard assumptions, where is the iteration counter. They have several computational advantages compared to existing methods such as simple to implement and less parameter tuning requirements. They can also work with both single sample or minibatch on derivative estimators, and with constant or diminishing stepsizes. We demonstrate the benefits of our algorithms over existing methods through two numerical examples, including a nonsmooth and nonconvexnonstrongly concave minimax model.
1 Introduction
We study the following stochastic minimax problem with nonconvexlinear objective function, which covers various practical problems in different fields, see, e.g., BenTal2009 (); Facchinei2003 (); goodfellow2014generative ():
(1) 
where is a stochastic vector function defined on a probability space , is a given matrix, is an inner product, and and are proper, closed, and convex functions Bauschke2011 (). Problem (1) is a special case of the nonconvexconcave minimax problem, where is nonconvex in and linear in .
Due to the linearity of w.r.t. , (1) can be reformulated into a general stochastic compositional nonconvex problem of the form:
(2) 
where is a convex, but possibly nonsmooth function, defined as
(3) 
with being the Fenchel conjugate of Bauschke2011 (), and we define . Note that problem (2) is completely different from existing models such as drusvyatskiy2019efficiency (); duchi2018stochastic (), where the expectation is inside the outer function , i.e., . We refer to this setting as a “nonseparable” model.
Challenges: Developing numerical methods for solving (1) or (2) faces several challenges. First, it is often nonconvex, i.e., is not affine. Many recent papers consider special cases of (2) when in (2) is convex by imposing restrictive conditions, which are unfortunately not realistic in applications. Second, the maxform in (3) is often nonsmooth if is not strongly convex. This prevents the use of gradientbased methods. Third, since the expectation is inside , it is very challenging to form an unbiased estimate for [sub]gradients of , making classical stochastic gradientbased methods inapplicable. Finally, proxlinear operatorbased methods as in drusvyatskiy2019efficiency (); duchi2018stochastic (); tran2020stochastic (); zhang2020stochastic () require large minibatch evaluations of both function value and its Jacobian , see tran2020stochastic (); zhang2019multi (); zhang2020stochastic (), instead of single sample or small minibatch, making them less flexible and more expensive than gradientbased methods.
Related work: Problem (1) has recently attracted considerable attention due to key applications, e.g., in game theory, robust optimization, distributionally robust optimization, and generative adversarial nets (GANs) BenTal2009 (); Facchinei2003 (); goodfellow2014generative (); rahimian2019distributionally (). Various firstorder methods have been developed to solve (1) during the past decades for both convexconcave models , e.g., Bauschke2011 (); Korpelevic1976 (); Nemirovskii2004 (); tseng2008accelerated () and nonconvexconcave settings lin2018solving (); lin2019gradient (); loizou2020stochastic (); ostrovskii2020efficient (); thekumparampil2019efficient (). Some recent works consider a nonnonvexnonconcave formulation, e.g., nouiehed2019solving (); yang2020global (). However, they still rely on additional assumptions to guarantee that the maximization problem in (3) can globally be solved. One wellknown assumption is the PolyakŁojasiewicz (PL) condition, which is rather strong and often used to guarantee linear convergence rates. A majority of these works focus on deterministic models, while some methods have been extended to stochastic settings, e.g., lin2018solving (); yang2020global (). Although (1) is a special case of a general model in lin2018solving (); lin2019gradient (); yang2020global (), it almost covers all examples in lin2018solving (); yang2020global (). Compared to these, we only consider a special class of minimax problems where the function is linear in . However, our algorithm is rather simple with a single loop, and our oracle complexity is significantly improved over the ones in lin2018solving (); yang2020global ().
In a very recent work luo2020stochastic (), which is concurrent to our paper, the authors develop a doubleloop algorithm, called SREDA, to handle a more general case than (1) where is strongly concave in . Their method exploits the SARAH estimator introduced in nguyen2017sarah () and can achieve the same oracle complexity as ours in Theorem 3.1 below. Compared to our work here, though the problem setting in luo2020stochastic () is more general than (1), it does not cover the nonstrongly convex case. This is important to handle stochastic constrained optimization problems, where is nonsmooth and convex, but not necessarily strongly convex (see, e.g., (32) below as an example). Moreover, the SREDA algorithm in luo2020stochastic () requires double loops with large minibatch sizes in both function values and derivatives and uses small learning rates to achieve the desired oracle complexity.
It is interesting that the minimax problem (1) can be reformulated into a nonconvex compositional optimization problem of the form (2). The formulation (2) has been broadly studied in the literature under both deterministic and stochastic settings, see, e.g., drusvyatskiy2019efficiency (); duchi2018stochastic (); Lewis2008 (); Nesterov2007g (); TranDinh2011 (); wang2017stochastic (). If and , then (2) reduces to the standard stochastic optimization model studied e.g., in ghadimi2016accelerated (); Pham2019 (). In the deterministic setting, one common method to solve (2) is the proxlineartype method, which is also known as a GaussNewton method Lewis2008 (); Nesterov2007g (). This method has been studied in several papers, including drusvyatskiy2019efficiency (); duchi2018stochastic (); Lewis2008 (); Nesterov2007g (); TranDinh2011 (). However, the proxlinear operator often does not have a closed form expression, and its evaluation may require solving a general nonsmooth strongly convex subproblem.
In the stochastic setting as (2), wang2017stochastic (); wang2017accelerating () proposed stochastic compositional gradient methods to solve more general forms than (2), but they required a set of stronger assumptions than Assumptions 2.12.2 below, including the smoothness of . Recent related works include lian2017finite (); liu2017variance (); xu2019katyusha (); yang2019multilevel (); yu2017fast (), which also rely on similar ideas. For instance, lin2018solving () proposed a double loop subgradientbased method with oracle complexity. Another subgradientbased method was recently proposed in yang2020global () based on a twoside PL condition. Stochastic methods exploiting proxlinear operators have also been recently proposed in tran2020stochastic (); zhang2020stochastic (), which are essentially extensions of existing deterministic methods to (2). Together with algorithms, convergence guarantees, stochastic oracle complexity bounds have also been estimated. For instance, wang2017stochastic () obtained oracle complexity for (2), while it was improved to in wang2017accelerating (). Recent works zhang2019multi (); zhang2019stochastic () further improved the complexity to . These methods require the smoothness of both and , use large batch sizes, and need a doubleloop scheme. In contrast, our method has single loop, can work with either single sample or minibatch, and allows both constant or diminishing stepsizes. For nonsmooth , under the same assumptions as tran2020stochastic (); zhang2020stochastic (), our methods achieve Jacobian and function evaluation complexity as in those papers. However, our method is gradientbased, which only uses proximal operator of and instead of a complex proxlinear operator as in tran2020stochastic (); zhang2020stochastic (). Note that even if and have closedform proximal operator, the proxlinear operator still does not have a closedform solution, and requires to solve a composite and possibly nonsmooth strongly convex subproblem involving a linear operator, see, e.g., tran2020stochastic (). Moreover, our method can work with both single sample and minibatch for Jacobian compared to a large batch size as in tran2020stochastic (); zhang2020stochastic ().
Our contribution: Our main contribution in this paper can be summarized as follows:

We develop a new singleloop hybrid variancereduced SGD algorithm to handle (1) under Assumptions 2.1 and 2.2 below. Under the strong convexity of , our algorithm has convergence rate to approximate a KKT (KarushKuhnTucker) point of (1), where is the batch size and is the iteration counter. We also estimate an oracle complexity to obtain an KKT point, matching the best known one as, e.g., in luo2020stochastic (); zhang2019multi (); zhang2019stochastic (). Our complexity bound holds for a wide range of as opposed to a specific choice as in luo2020stochastic (); zhang2019multi (); zhang2019stochastic (). Moreover, our algorithm has only a single loop compared to luo2020stochastic (); zhang2019multi ()..

When is nonstrongly convex, we combine our approach with a smoothing technique to develop a gradientbased variant, that can achieve the bestknown Jacobian and function evaluations of for finding an KKT point of (1). Moreover, our algorithm does not require proxlinear operators and large batches for Jacobian as in tran2020stochastic (); zhang2020stochastic ().

We also propose a simple restarting technique without sacrificing convergence guarantees to accelerate the practical performance of both cases (a) and (b) (see Supp. Doc. C).
Our methods exploit a recent biased hybrid estimators introduced in TranDinh2019a () as opposed to SARAH ones in tran2020stochastic (); zhang2019multi (); zhang2020stochastic (). This allows us to simplify our algorithm with a single loop and without large batches at each iteration compared to zhang2019multi (). As indicated in arjevani2019lower (), our oracle complexity is optimal under the considered assumptions. If is nonstrongly convex (i.e. in (2) can be nonsmooth), then our algorithm is fundamentally different from the ones in tran2020stochastic (); zhang2020stochastic () as it does not use proxlinear operator. Note that evaluating a proxlinear operator requires to solve a general strongly convex but possible nonsmooth subproblem. In addition, they only work with large batch sizes of both and .
2 Basic assumptions, KKT points and smoothing technique
Notation: We work with finitedimensional space equipped with standard inner product and Euclidean norm . For a function , denotes its domain. If is convex, then denotes its proximal operator, denotes its subdifferential, and is its [sub]gradient, see, e.g., Bauschke2011 (). is strongly convex with a strongly convex parameter if remains convex. For a smooth vector function , denotes its Jacobian. We use to denote the Euclidean distance from to a convex set .
2.1 Model assumptions
Let denote the expectation function of and denote the domain of . Throughout this paper, we always assume that in (2) and is proper, closed, and convex without recalling them in the sequel. Our goal is to develop stochastic gradientbased algorithms to solve (1) relying on the following assumptions:
Assumption 2.1.
Note that Assumptions 2.1 are standard in stochastic nonconvex optimization, see tran2020stochastic (); zhang2019multi (); zhang2019stochastic (); zhang2020stochastic (). If is bounded, then is bounded, and this assumption automatically holds.
For , we only require the following assumption, which is mild and holds for many applications.
Assumption 2.2.
The function in (1) is proper, closed, and convex. Moreover, is bounded by , i.e.: .
An important special case of is the indicator of convex and bounded sets. Hitherto, we do not require and in (2) to be smooth or strongly convex. They can be nonsmooth so that (2) can also cover constrained problems. Note that the boundedness of is equivalent to the Lipschitz continuity of (Lemma A.1). Simple examples of include norms and gauge functions.
2.2 KKT points and approximate KKT points
Since (1) is nonconvexconcave, a pair is said to be a KKT point of (1) if
(6) 
From (6), we have . Substituting this into the first expression, we get
(7) 
Here, we have used , where is given by (3) This inclusion shows that is a stationary point of (2). In the convexconcave case, under mild assumptions, a KKT point is also a saddlepoint of (1). In particular, if (2) is convex, then is also its global optimum of (2).
However, in practice, we can only find an approximation of a KKT point for (1).
Definition 2.1.
Given any tolerance , is called an KKT point of (1) if
(8) 
2.3 Smoothing techniques
Under Assumption 2.2, defined by (3) can be nonsmooth. Hence, we can smooth as follows:
(9) 
where is a continuously differentiable and strongly convex function such that , and is a smoothness parameter. For example, we can choose for a fixed or defined on a standard simplex Nesterov2005c (). Under Assumption 2.2, possesses some useful properties as stated in Lemma A.1 (Supp. Doc. A.1).
Let be an optimal solution of the maximization problem in (9), which always exists and is unique. In particular, if , then
(10) 
Hence, when is proximally tractable (i.e., its proximal operator can be computed in a closedform or with a loworder polynomial time algorithm), computing reduces to evaluating the proximal operator of as opposed to solving a complex subproblem as in proxlinear methods tran2020stochastic (); zhang2020stochastic ().
Given defined by (9), we consider the following functions:
(11) 
In this case, under Assumptions 2.1 and 2.2, is continuously differentiable, and
(12) 
Smoothness: Moreover, is smooth with (see zhang2019stochastic ()), i.e.:
(13) 
where and are given in Lemma A.1.
3 The proposed algorithm and its convergence analysis
First, we introduce a stochastic estimator for . Then, we develop our main algorithm and analyze its convergence and oracle complexity. Finally, we show how to construct an KKT point of (1).
3.1 Stochastic estimators and the algorithm
Since is the expectation of a stochastic function , we exploit the hybrid stochastic estimators for and its Jacobian introduced in TranDinh2019a (). More precisely, given a sequence generated by a stochastic algorithm, our hybrid stochastic estimators and are defined as follows:
(15) 
where are given weights, and the initial estimators and are defined as
(16) 
Here, , , , , , and are minibatches of sizes , , , , , and , respectively. We allow to be correlated with , and to be correlated with . We also do not require any independence between these minibatches. When and , our estimators reduce the STORM estimators studied in Cutkosky2019 () as a special case. Clearly, with the choices and , we can save function value evaluations and Jacobian evaluations at each iteration.
For and defined by (15), we introduce a stochastic estimator for the gradient of in (11) at as follows:
(17) 
To evaluate , we need to compute , which requires just one if we use (10). Moreover, due to (16) and (17), evaluating does not require the full matrix , but a matrixvector product , which is often cheaper than evaluating .
Algorithm 1 is designed by adopting the idea in TranDinh2019a (), where it can start from two initial batches and to generate a good approximation for the search direction before getting into the main loop. But if diminishing stepsizes are use, it does not require such initial batchs. However, it has major differences compared to TranDinh2019a (): the dual step , the estimator , and the dynamic parameter updates. Note that, as explained in (10), since the dual step can be computed using , Algorithm 1 is single loop, making it easy to implement in practice compared to methods based on SVRG johnson2013accelerating () and SARAH nguyen2017sarah () such as luo2020stochastic (); zhang2019multi ().
3.2 Convergence analysis of Algorithm 1
Let be the field generated by Algorithm 1 up to the th iteration, which is defined as follows:
(18) 
If is strongly convex, then, without loss of generality, we can assume . Otherwise, we can rescale it. Moreover, for the sake of our presentation, for a given , we introduce:
(19) 
where , , , , and are given in Assumption 2.1 and is in Assumption 2.2. Here, if the minibatch is independent of , and , otherwise. Similarly, if is independent of , and , otherwise.
The strongly concave case
Theorem 3.1, whose proof is in Supp. Doc. B.3, analyzes convergence rate and complexity of Algorithm 1 for the smooth case of in (2) (i.e., is strongly convex).
Theorem 3.1 (Constant stepsize).
Suppose that Assumptions 2.1 and 2.2 hold, is strongly convex with , and , , and are defined in (19). Given a minibatch , let , , and . Let be generated by Algorithm 1 using
(20) 
provided that . Let for some . Then, we have
(21) 
For a given tolerance , the total number of iterations to obtain is at most . The total numbers of function evaluation and its Jacobian evaluations are at most and , respectively.
Theorem 3.2 (Diminishing stepsize).
The nonstrongly concave case
Now, we consider the case , i.e., is nonstrongly convex (or equivalently, (1) is nonstrongly concave in ), leading to the nonsmoothness of in (2). Theorem 3.3 states convergence of Algorithm 1 in this case, whose proof is in Supp. Doc. B.5.
Theorem 3.3 (Constant stepsize).
Assume that Assumptions 2.1 and 2.2 hold, in (1) is nonstrongly convex i.e., is nonsmooth, and , , and are defined in (19). Let and be two positive integers, , and be generated by Algorithm 1 after iterations using:
(24) 
Then, with defined in Lemma A.1, the following bound holds
(25) 
The total number of iterations to achieve is at most . The total numbers of function evaluations and Jacobian evaluations are respectively at most
If we choose for some , then .
Alternatively, we can also establish convergence and estimate the complexity of Algorithm 1 with diminishing stepsize in Theorem 3.4, whose proof is in Supp. Doc. B.6.
Theorem 3.4 (Diminishing stepsize).
3.3 Constructing approximate KKT point for (1) from Algorithm 1
Existing works such as luo2020stochastic (); zhang2019multi (); zhang2020stochastic () do not show how to construct an KKT point of (1) or an stationary point of (2) from with . Lemma 3.1, whose proof is in Supp. Doc. A.3, shows one way to construct an KKT point of (1) in the sense of Definition 2.1 with from the output of Theorems 3.1, 3.2, 3.3, and 3.4.
Lemma 3.1.
Let be computed by Algorithm 1 up to an accuracy after iterations. Assume that we can approximate , , and , respectively such that
(28) 
Let us denote and compute as
(29) 
Suppose that and for a constant . Then