An Optimal Multistage Stochastic Gradient Method for Minimax Problems
In this paper, we study the minimax optimization problem in the smooth and strongly convex-strongly concave setting when we have access to noisy estimates of gradients. In particular, we first analyze the stochastic Gradient Descent Ascent (GDA) method with constant stepsize, and show that it converges to a neighborhood of the solution of the minimax problem. We further provide tight bounds on the convergence rate and the size of this neighborhood. Next, we propose a multistage variant of stochastic GDA (M-GDA) that runs in multiple stages with a particular learning rate decay schedule and converges to the exact solution of the minimax problem. We show M-GDA achieves the lower bounds in terms of noise dependence without any assumptions on the knowledge of noise characteristics. We also show that M-GDA obtains a linear decay rate with respect to the error’s dependence on the initial error, although the dependence on condition number is suboptimal. In order to improve this dependence, we apply the multistage machinery to the stochastic Optimistic Gradient Descent Ascent (OGDA) algorithm and propose the M-OGDA algorithm which also achieves the optimal linear decay rate with respect to the initial error. To the best of our knowledge, this method is the first to simultaneously achieve the best dependence on noise characteristic as well as the initial error and condition number.
The minimax optimization problem has recently gained tremendous attention as the canonical problem formulation for robust training of machine learning models and Generative Adversarial Networks (GANs) (see Madry et al. (2018); Goodfellow et al. (2014); Arjovsky et al. (2017)). While many papers have studied the convergence of a broad range of algorithms in the deterministic setting, i.e., when the gradient information is exact, many aspects of this problem in the stochastic setting are yet to be explored. This is the main goal of our manuscript as we provide a framework for analyzing minimax optimization algorithms which can be used for both the deterministic and stochastic settings.
We consider the minimax problem
where is smooth and -strongly convex-strongly concave (See Section 2 for the precise statement of our assumptions). The condition number of the problem is defined as . Due to the strong convexity-strong concavity of the function , this problem has a unique saddle point which we denote by , i.e.,
In this paper, our main focus is on the case when the exact gradient information is not available and we only have access to an unbiased estimate through a stochastic oracle.
More formally, we assume
This setting arises in many applications, including the problem of training GANs where the generator and the discriminator approximate the gradient by taking a batch of data and computing where and is the gradient computed over a single data point . It is worth noting that the inexact gradient issue also appears in other scenarios such as privacy-related applications where the noise is added intentionally to prevent the model from remembering possibly sensitive data and preserve privacy Xie et al. (2018) or when the presence of noise is inevitable due to imperfections in communication and sensing.
In solving the minimization problem in the stochastic setting, it is well-known that, for many algorithms, the squared distance of the iterates to the solution of the minimization problem can be bounded by the sum of two terms: bias and variance Bach and Moulines (2013); Ghadimi and Lan (2012); Aybat et al. (2019). The bias term captures the effect of the initialization expressed in terms of the distance of the initial point to the solution, and is independent of the noise parameters. The variance term depends on noise characteristics ( in our case) and is independent of the initialization error. For the minimization problem with strongly convex objective function, and in the noiseless case (with only the bias term), Nemirovsky and Yudin (1983) have shown the lower bound of for the distance of the -th iterate to the optimal solution. With noise Raginsky and Rakhlin (2011) have shown the lower bound increases to . Several papers have highlighted the trade-off between bias and variance which arises in design of optimization algorithms Aybat et al. (2018) and tried to achieve both lower bounds simultaneously Ghadimi and Lan (2013); Aybat et al. (2019).
In this paper, we highlight this bias-variance decomposition in evaluating the performance of algorithms that solve the minimax problem. For the bias term, i.e., the deterministic case, Ibrahim et al. (2019) have recently shown the lower bound highlighting that the dependence on condition number increases from in minimization problems to for minimax problems. For the variance term, since the minimax problem is a special case of the minimization problem, the lower bound of the minimization problem is also valid for the minimax problem. While this lower bound for variance term has been obtained Hsieh et al. (2019); Rosasco et al. (2014) at the cost of making the bias term sublinear, the question of whether a linear rate in bias and in variance could be achieved simultaneously has not been addressed prior to this work.
In what follows, we first provide a summary of related works and then discuss the main contributions of our paper.
1.1 Related Work
Many papers have studied the minimax problem when the exact gradient information is available. In the case of Gradient Descent Ascent (GDA) method, Du and Hu (2019) analyzes its performance for the special case of bilinear coupling, i.e., when where is smooth and convex, is smooth and strongly convex, and the matrix has full column rank. They show that running the GDA algorithm for steps on this problem reaches a point which is close to the saddle point. In addition, when the function is assumed to be strongly convex, GDA reaches a point which is close to the saddle point after steps. Liang and Stokes (2019) extend this result to a general function which is strongly convex in and strongly concave in (achieving the same rate of convergence as Du and Hu (2019)). Several other gradient based algorithms like the Optimistic Gradient Descent Ascent (OGDA) method (see Daskalakis et al. (2018)) and the Extragradient method Korpelevich (1976) have been analyzed in recent papers including Mokhtari et al. (2019a); Liang and Stokes (2019); Gidel et al. (2019); Mokhtari et al. (2019b); Hsieh et al. (2019). These papers analyze these algorithms in several settings including bilinear, strongly convex-strongly concave and convex-concave. More specifically, Gidel et al. (2019); Mokhtari et al. (2019a) show that when the objective function is strongly convex-strongly concave, running the OGDA and Extragradient algorithms for steps reaches a point which is close to the saddle point.
The papers which are closest to our results are Rosasco et al. (2014) and Hsieh et al. (2019). Rosasco et al. (2014) propose a forward-backward splitting algorithm to solve the stochastic minimax problem (they solve the more general problem of monotone inclusions). When the function is strongly convex-strongly concave, they show convergence at a rate of to the saddle point, where is any constant greater than and is a constant larger than 1. Hsieh et al. (2019) show that the stochastic version of OGDA converges to the saddle point at a rate of for both bias and variance when the objective function is strongly convex-strongly concave.
There are several papers which analyze the stochastic minimax problem when the objective function is convex-concave. Juditsky et al. (2011) propose the stochastic mirror-prox algorithm (a special case of which is the stochastic extragradient method) to solve the convex-concave saddle point problem with noisy gradients. They assume the constraint set is compact and show a convergence rate of (the result in this paper improves on the robust stochastic approximation algorithm proposed in Nemirovski et al. (2009)). Chen et al. (2014) proposes an accelerated primal dual algorithm which achieves a convergence rate of . Recently, Mertikopoulos et al. (2018) analyzed the stochastic extragradient algorithm for coherent minimax problems (a condition slightly weaker than convex-concave assumption) and they show asymptotic convergence to a saddle point. Gidel et al. (2019) analyzed a single call version of extragradient (which corresponds to OGDA) when the function is convex-concave and they showed that in the stochastic setting, this algorithm converges to the saddle point at a rate of .
Another line of work is the case where the objective function has a finite sum structure and the gradient of the entire function cannot be computed at each step. Several papers including Bot et al. (2019); Palaniappan and Bach (2016); Chavdarova et al. (2019); Iusem et al. (2017) analyze this setting and apply variance reduction techniques (like SVRG and SAGA) to improve convergence rates to the saddle point.
1.2 Our Contribution
We first analyze GDA with constant stepsize (learning rate) where we build our analysis by casting it as a dynamical system, an approach that has gained attention in the optimization and machine learning literature recently Lessard et al. (2016); Hu and Lessard (2017); Aybat et al. (2018, 2019). In particular, we show that GDA with any stepsize converges to an neighborhood of the optimal solution at a linear rate . Next, we propose a novel Multistage-Stochastic Gradient Descent Ascent scheme (inspired from Aybat et al. (2019)) which achieves a rate of for the variance term (which is optimal in terms of dependence) and a rate of for the bias term, and we show that the and dependence of the latter cannot be improved for GDA dynamics.
Next, we focus on the OGDA method which has gained widespread attention for solving minimax problems. We first highlight that OGDA also converges to an neighborhood of the optimal solution with linear rate , but allows for a broader range of for the stepsize. Then, we introduce the Multistage version of Stochastic Optimistic Gradient Descent Ascent (M-OGDA) which achieves the rate of for the variance term and a rate of for the bias term which improves on the decay of the bias term of GDA and matches the lower bound shown in Ibrahim et al. (2019).
We denote identity and zero matrices by and , respectively. Throughout this paper, all vectors are represented as column vectors. The superscript represents the transpose of a vector or a matrix. For two matrices and , their Kronecker product is represented by . Also, implies that is a symmetric and positive semidefinite matrix.
We first state formally the strong convexity(concavity) and smoothness properties of a function.
A convex function is -smooth and -strongly convex if it satisfies the following two conditions for all :
Further, is -strongly concave if is -strongly convex.
For an smooth convex function , we have the following characterization (see Theorem 2.1.5 in Nesterov (2004)):
Throughout the paper, we assume the following:
We assume at iterate , we have access to which are unbiased estimates of and , respectively, i.e.,
In addition, we assume and are independent from each other and previous iterates. Moreover, we assume
To simplify the notation, we suppress the and dependence throughout the paper.
The function is continuously differentiable in and . For any , is -smooth and -strongly convex as a function of . Similarly, for any , is -smooth and -strongly concave as a function of .
In addition, the gradient is -Lipschitz in , i.e.,
Similarly, the gradient is -Lipschitz in , i.e.,
Note that this assumption leads to the saddle point being unique and, in addition, we have and .
Under Assumptions 2.3, we call the function as -smooth and -strongly convex- strongly concave where and . We define the condition number of the problem as .
We next present some key properties of smooth strongly convex-strongly concave functions that will be used in our analysis. Define:
where . For , we define
Also, we define as the unique saddle point. The following lemma follows from the strong convexity and smoothness properties of .
Check Appendix A. ∎
Using Lemma 2.5, we can prove the following result
Check Appendix B. ∎
3 Analysis of Stochastic Gradient Descent Ascent Method
In this section, we study the Stochastic Gradient Descent Ascent (GDA) algorithm, which is given by:
This can be succinctly written as:
Using this notation, we can represent GDA as a dynamical system as follows:
where . We study the convergence properties of the sequence through the evolution of the Lyapunov function where with an arbitrary constant. In particular, in the following lemma, we first bound the difference of for any and . We skip the proof as it is very similar to the proof of Lemma B.1 in Aybat et al. (2019).
Let with and consider the function . Then we have
Next, using this lemma, we characterize the convergence of GDA.
See Appendix C. ∎
It is worth noting that the range for the stepsize in Theorem 3.2 is upper bounded by (as opposed to just a function of the Lipschitz parameter , as is the case for Gradient Descent in minimization problems). This is consistent with the fact that GDA may diverge when the strong convexity parameter is 0, i.e., the function is convex-concave (see the Bilinear example in Daskalakis et al. (2018)).
3.1 Tightness of the Results
In this subsection, we give an example of a function where after running GDA for iterations reaches a point which is close to the saddle point. Consider the function
where and . The condition number of this function is and the saddle point of this function is .
Let be the iterates generated by GDA for the objective function given in Equation (18). Then,
(i) if the gradient at each step is exactly available (i.e. the updates reduce to the deterministic GDA updates), we have:
(ii) if at each step the gradients are corrupted by additive i.i.d. noise with a distribution , we have
See Appendix D. ∎
Example 3.4(i) shows that in order to find the saddle point of the function defined in equation (18), we need to run at least steps of GDA (i.e. the deterministic case) to reach a point which is -close to the solution, showing that this dependence on cannot be improved. Example 3.4(ii) shows that when the gradients are corrupted by noise with variance , GDA reaches an neighborhood of the saddle point and this dependence on cannot be improved.
4 A Multistage Stochastic Gradient Descent Ascent Method (M-GDA)
Our result in Theorem 3.2 shows that for GDA with constant stepsize , the iterates converge to an neighborhood of the saddle point. In this section, we introduce a new method which is a variant of GDA with progressively decreasing stepsize that converges to the exact unique saddle point of problem 1. Our proposed algorithm, Multistage Stochastic Gradient Descent Ascent (M-GDA), which is presented in Algorithm 1, runs in several stages where each stage is the GDA method with constant stepsize. In what follows, we show our multistage method with a carefully chosen learning rate and step length evolution achieves linear decay in the bias term as well as optimal variance dependence without any knowledge of the noise properties.
We prove the result by induction on . To simplify the notation, we define .
First, and for , note that, by using Theorem 3.2 along with the fact that , we have:
where we plugged in to obtain the last equality. Hence, the result holds for . Now, assume the result holds for , and we show it for . Note that, Theorem 3.2 for stage yields:
where we used and to derive the last inequality. Now, note that, by induction hypothesis, we have
Substituting this bound in (23), we obtain
where the last bound follows from . This completes the proof. ∎
The above theorem provides an upper bound on the distance of the last iterate of each stage to the saddle point of problem (1). Using this result, and in the following corollary, we provide an upper bound on the distance of any iterate from the saddle point. Before stating this corollary, let be the sequence which is obtained by concatenating the sequences, i.e.,
Check Appendix E ∎
We interpret this result in two different regimes. First, we consider the case where we are given a fixed budget of iterations. In this case, the following corollary shows how we can tune the parameters to obtain linear decay in the bias term as well as reduction in the variance term. We omit the proof as it is an immediate application of Corollary 4.2.
Finally, in the following corollary, we illustrate how our results can be applied to the case where we do not know the number of iterations in advance.
It is worth noting that the results in Corollaries 4.3 and 4.4 are presented in terms of the iterate which is obtained by concatenating the iterates of all stages, including inner iterations (as given in (28)). In fact, while it is true that the number of inner stage iterations increases, the bounds in Table 1 and Corollaries 4.3 and 4.4, are all based on the total number of iterations, and therefore, they take into account the inner stage iterations.
5 A Multistage Stochastic Optimistic Gradient Descent Ascent Method (M-OGDA)
As we showed in previous Section, M-GDA achieves the optimal variance rate as well as linear decay in the bias term. However, the dependence of the latter to condition number is suboptimal compared to the lower bound presented in Ibrahim et al. (2019). Therefore, a natural question is whether we can design an algorithm which matches the lower bound for the bias term while simultaneously enjoying the optimal variance decay. In this section, we show that this is possible, and we do so by applying the multistage machinery to the stochastic Optimistic Gradient Descent Ascent (OGDA) algorithm. In this section, we first revisit the existing results on convergence of stochastic OGDA method, and next, show how its multistage version (M-OGDA) can matches both lower bounds simultaneously.
The stochastic OGDA method is given by:
which can also be written as:
where recall that , is the stepsize, and are the stochastic gradients (unbiased estimates of the true gradients) (13).
The OGDA updates have been observed to permorm well empirically for training GANs (see Daskalakis et al. (2018)) and have been proved to converge for convex-concave problems (see Mokhtari et al. (2019a); Hsieh et al. (2019)) which is not true for GDA.
As shown in Gidel et al. (2019); Hsieh et al. (2019), the OGDA updates can also be thought of as a single call version of the Extragradient method (since it reuses a gradient from the past) and using this interpretation, the OGDA algorithm can also be written as follows
Note that the difference from Extragradient (EG) is that in EG, the update for involves the gradient at whereas here we use the gradient at instead. We will use this form of the Stochastic OGDA updates for our analysis. From the analysis
This is similar to Theorem 3.2 for GDA. However, we can see that for OGDA, the range of permissible stepsizes goes all the way up to whereas for GDA, the stepsizes are upper bounded by .
The result in Theorem 5.1 shows that for OGDA with constant stepsize , the iterates converge to an neighborhood of the saddle point. Next, we analyze a multistage version of OGDA (M-OGDA) similar to the analysis of M-GDA in Section 4. We show that the iterates of M-OGDA converge to the unique saddle point at a rate where the variance decays as , which is optimal (and also achieved by M-GDA), but the bias term decays as , which “accelerates” GDA in terms of its dependence on . More formally, we state the following theorem which is analogous to Theorem 4.1 for M-GDA and present the convergence rate of M-OGDA (we omit the proof as it is very similar to that of Theorem 4.1):
Similar to the discussion in Section 4, we next state how our result leads to bounds on distance of each iterate to the saddle point of Problem 1. In addition, we propose proper choice of parameters in general as well as in the case that the iteration budget is known in advance, the results corresponding to Corollaries 4.2 and 4.4 for M-GDA. Before stating this corollary, we define to be the sequence which is obtained by concatenating the sequences, i.e.,
Suppose that the conditions in Assumptions 2.2 and 2.3 are satisfied. Let be the iterates generated by M-OGDA (Algorithm 2) with the parameters given in Theorem 4.1. Also, recall the definition of the concatenated sequence from (35). Then, for any , we have
In particular, assume choosing . Then, for any , we have
Also, when the number of iterations is known in advance, choosing with , implies
for any .
Once again, we would like to highlight that the results in Corollary 5.3 are presented in terms of the iterate which is obtained by concatenating the iterates of all stages, including inner iterations (as given in (35)). As a result, in comparing our results to other methods in Table 1, we take into account the inner stage iterations.
In this paper, we propose multistage versions of Gradient Descent Ascent (GDA) and Optimistic Gradient Descent Ascent (OGDA) algorithms to solve the stochastic minimax problems. In particular, these algorithms are the first to achieve linear rate in bias and optimal rate in variance, simultaneously. We also show that Multistage OGDA improves the bias rate of Multistage GDA from to which is the best known rate in deterministic minimax optimization.