Stochastic Variance Reduced Primal Dual Algorithms for Empirical Composition Optimization

# Stochastic Variance Reduced Primal Dual Algorithms for Empirical Composition Optimization

Adithya M. Devraj     and     Jianshu Chen Department of Electrical and Computer Engineering, University of Florida, Gainesville, USA. Email: adithyamdevraj@ufl.edu. The work was done during an internship at Tencent AI Lab, Bellevue, WA. Tencent AI Lab, Bellevue, WA, USA. Email: jianshuchen@tencent.com.
###### Abstract

We consider a generic empirical composition optimization problem, where there are empirical averages present both outside and inside nonlinear loss functions. Such a problem is of interest in various machine learning applications, and cannot be directly solved by standard methods such as stochastic gradient descent (SGD). We take a novel approach to solving this problem by reformulating the original minimization objective into an equivalent min-max objective, which brings out all the empirical averages that are originally inside the nonlinear loss functions. We exploit the rich structures of the reformulated problem and develop a stochastic primal-dual algorithm, SVRPDA-I, to solve the problem efficiently. We carry out extensive theoretical analysis of the proposed algorithm, obtaining the convergence rate, the total computation complexity and the storage complexity. In particular, the algorithm is shown to converge at a linear rate when the problem is strongly convex. Moreover, we also develop an approximate version of the algorithm, SVRPDA-II, which further reduces the memory requirement. Finally, we evaluate the performance of our algorithms on several real-world benchmarks, and experimental results show that the proposed algorithms significantly outperform existing techniques.

Stochastic Variance Reduced Primal Dual Algorithms
for Empirical Composition Optimization

Adithya M. Devrajthanks: Department of Electrical and Computer Engineering, University of Florida, Gainesville, USA. Email: adithyamdevraj@ufl.edu. The work was done during an internship at Tencent AI Lab, Bellevue, WA.     and     Jianshu Chenthanks: Tencent AI Lab, Bellevue, WA, USA. Email: jianshuchen@tencent.com.

\@float

noticebox[b]Preprint. Under review.\end@float

## 1 Introduction

In this paper, we consider the following regularized empirical composition optimization problem:

 minθ1nXnX−1∑i=0ϕi(1nYinYi−1∑j=0fθ(xi,yij))+g(θ), (1)

where is the -th data sample, is a function parameterized by , is a convex merit function, which measures a certain loss of the parametric function , and is a -strongly convex regularization term.

Problems of the form (1) widely appear in many machine learning applications such as reinforcement learning [5, 3, 2, 13], unsupervised sequence classification [11, 21] and risk-averse learning [15, 18, 8, 9, 19] — see our detailed discussion in Section 2. Note that the cost function (1) has an empirical average (over ) outside the (nonlinear) merit function and an empirical average (over ) inside the merit function, which makes it different from the empirical risk minimization problems that are common in machine learning [17]. Problem (1) can be understood as a generalized version of the one considered in [8, 9].111In addition to the term in (2), the cost function in [9] also has another convex regularization term. In these prior works, and are assumed to be independent of and is only a function of so that problem (1) can be reduced to the following special case:

 minθ1nXnX−1∑i=0ϕi(1nYnY−1∑j=0fθ(yj)). (2)

Our more general problem formulation (1) encompasses wider applications (see Section 2). Furthermore, different from [2, 19, 18], we focus on the finite sample setting, where we have empirical averages (instead of expectations) in (1). As we shall see below, the finite-sum structures allows us to develop efficient stochastic gradient methods that converges at linear rate.

While problem (1) is important in many machine learning applications, there are several key challenges in solving it efficiently. First, the number of samples (i.e., and ) could be extremely large: they could be larger than one million or even one billion. Therefore, it is unrealistic to use batch gradient descent algorithm to solve the problem, which requires going over all the data samples at each gradient update step. Moreover, since there is an empirical average inside the nonlinear merit function , it is not possible to directly apply the classical stochastic gradient descent (SGD) algorithm. This is because sampling from both empirical averages outside and inside simultaneously would make the stochastic gradients intrinsically biased (see Appendix A for a discussion).

To address these challenges, in this paper, we first reformulate the original problem (1) into an equivalent saddle point problem (i.e., min-max problem), which brings out all the empirical averages inside and exhibits useful dual decomposition and finite-sum structures (Section 3.1). To fully exploit these properties, we develop a stochastic primal-dual algorithm that alternates between a dual step of stochastic variance reduced coordinate ascent and a primal step of stochastic variance reduced gradient descent (Section 3.2). In particular, we develop a novel variance reduced stochastic gradient estimator for the primal step, which achieves better variance reduction with low complexity (Section 3.3). We derive the convergence rate, the finite-time complexity bound, and the storage complexity of our proposed algorithm (Section 4). In particular, it is shown that the proposed algorithms converge at a linear rate when the problem is strongly convex. Moreover, we also develop an approximate version of the algorithm that further reduces the storage complexity without much performance degradation in experiments. We evaluate the performance of our algorithms on several real-world benchmarks, where the experimental results show that they significantly outperform existing methods (Section 5). Finally, we discuss related works in Section 6 and conclude our paper in Section 7.

## 2 Motivation and Applications

To motivate our composition optimization problem (1), we discuss several important machine learning applications where cost functions of the form (1) arise naturally.

##### Unsupervised sequence classification:

Developing algorithms that can learn classifiers from unlabeled data could benefit many machine learning systems, which could save a huge amount of human labeling costs. In [11, 21], the authors proposed such unsupervised learning algorithms by exploiting the sequential output structures. The developed algorithms are applied to optical character recognition (OCR) problems and automatic speech recognition (ASR) problems. In these works, the learning algorithms seek to learn a sequence classifier by optimizing the empirical output distribution match (Empirical-ODM) cost, which is in the following form (written in our notation):

 minθ{−nX−1∑i=0pLM(xi)log(1nYnY−1∑j=0fθ(xi,yj))}, (3)

where is a known language model (LM) that describes the distribution of output sequence (e.g., represents different -grams), and is a functional of the sequence classifier to be learned, with being its model parameter vector. The key idea is to learn the classifier so that its predicted output -gram distribution is close to the prior -gram distribution (see [11, 21] for more details). The cost function (3) can be viewed as a special case of (1) by setting , and . Note that the formulation (2) cannot be directly used here, because of the dependency of the function on both and .

##### Risk-averse learning:

Another application where (1) arises naturally is the risk-averse learning problem, which is common in finance [15, 18, 8, 9, 19]. Let be a vector consisting of the rewards from assets at the -th instance, where . The objective in risk-averse learning is to find the optimal weights of the assets so that the average returns are maximized while the risk is minimized. It could be formulated as the following optimization problem:

 minθ−1nn−1∑i=0⟨xi,θ⟩+1nn−1∑i=0(⟨xi,θ⟩−1nn−1∑j=0⟨xj,θ⟩)2, (4)

where denotes the weight vector. The objective function in (4) seeks a tradeoff between the mean (the first term) and the variance (the second term). It can be understood as a special case of (2) (which is a further special case of (1)) by making the following identifications:

 nX=nY=n,yi≡xi,fθ(yj)=[θ\it T,−⟨yj,θ⟩]\it T,ϕi(u)=(⟨xi,u0:d−1⟩+ud)2−⟨xi,u0:d−1⟩, (5)

where denotes the subvector constructed from the first elements of , and denotes the -th element. An alternative yet simpler way of dealing with (4) is to treat the second term in (4) as a special case of (1) by setting

 nX=nYi=n,yij≡xj,fθ(xi,yij)=⟨xi−yij,θ⟩,ϕi(u)=u2,u∈R. (6)

In addition, we observe that the first term in (4) is in standard empirical risk minimization form, which can be dealt with in a straightforward manner. This second formulation leads to algorithms with lower complexity due to the lower dimension of the functions: instead of in the first formulation. Therefore, we will adopt this formulation in our experiment section (Section 5).

##### Other applications:

Besides the above examples, cost functions of the form (1) appear in other applications such as reinforcement learning [5, 2, 3]. The readers are referred to [18] for a list of additional applications.

## 3 Algorithms

Recall from (1) that there is an empirical average inside each (nonlinear) merit function , which prevents the direct application of stochastic gradient descent to (1) due to the inherent bias (see Appendix A for more discussions). Nevertheless, we will show that minimizing the original cost function (1) can be transformed into an equivalent saddle point problem, which brings out all the empirical averages inside . In what follows, we will use the machinery of convex conjugate functions [14].

For a function , its convex conjugate function is defined as . Under certain mild conditions on [14], one can also express as a functional of its conjugate function: . Let denote the conjugate function of . Then, we can express as

 ϕi(u) =supwi∈Rℓ(⟨u,wi⟩−ϕ∗i(wi)), (7)

where is the corresponding dual variable. Substituting (7) into the original minimization problem (1), we obtain its equivalent min-max problem as:

 minθmaxw{L(θ,w)+g(θ)≜1nXnX−1∑i=0[⟨1nYinYi−1∑j=0fθ(xi,yij),wi⟩−ϕ∗i(wi)]+g(θ)}, (8)

where , is a collection of all dual variables. We note that the transformation of the original problem (1) into (8) brings out all the empirical averages that are present inside . This new formulation allows us to develop stochastic variance reduced algorithms below.

### 3.2 Stochastic variance reduced primal-dual algorithm

One common solution for the min-max problem (8) is to alternate between the step of minimization (with respect to the primal variable ) and the step of maximization (with respect to the dual variable ). However, such an approach generally suffers from high computation complexity because each minimization/maximization step requires a summation over many components; in-fact, each step requires a full pass over all the data samples. The complexity of such a batch algorithm would be prohibitively high when the number of data samples (i.e., and ) is large (e.g., they could be larger than one million or even one billion in applications like unsupervised speech recognition [21]). On the other hand, problem (8) indeed has rich structures that we can exploit to develop more efficient solutions.

To this end, we make the following observations. First, expression (8) implies that when is fixed, the maximization over the dual variable can be decoupled into a total of individual maximizations over different ’s. Second, the objective function in each individual maximization (with respect to ) contains a finite-sum structure over . Third, by (8), for a fixed , the minimization with respect to the primal variable is also performed over an objective function with a finite-sum structure. Based on these observations, we will develop an efficient stochastic variance reduced primal-dual algorithm (named SVRPDA-I). It alternates between (i) a dual step of stochastic variance reduced coordinate ascent and (ii) a primal step of stochastic variance reduced gradient descent. The full algorithm is summarized in Algorithm 1, with its key ideas explained below.

##### Dual step: stochastic variance reduced coordinate ascent.

To exploit the decoupled dual maximization over in (8), we can randomly sample an index , and update according to:

 w(k)i =argminwi{−⟨1nYinYi−1∑j=0fθ(k−1)(xi,yij),wi⟩+ϕ∗i(wi)+12αw∥wi−w(k−1)i∥2}, (9)

while keeping all other ’s () unchanged, where denotes a step-size. Note that each step of recursion (9) still requires a summation over components. To further reduce the complexity, we approximate the sum over by a variance reduced stochastic estimator defined in (12) (to be discussed in Section 3.3). The dual step in our algorithm is summarized in (13), where we assume that the function is in a simple form so that the argmin could be solved in closed-form. Note that we flip the sign of the objective function to change maximization to minimization and apply coordinate descent. We will still refer to the dual step as “coordinate ascent” (instead of descent).

##### Primal step: stochastic variance reduced gradient descent

We now consider the minimization in (8) with respect to when is fixed. The gradient descent step for minimizing is given by

 θ(k) =argminθ{⟨nX−1∑i=0nYi−1∑j=01nXnYif′θ(k−1)(xi,yij)w(k)i,θ⟩+12αθ∥θ−θ(k−1)∥2}, (10)

where denotes a step-size. It is easy to see that the update equation (10) has high complexity, it requires evaluating and averaging the gradient at every data sample. To reduce the complexity, we use a variance reduced gradient estimator, defined in (15), to approximate the sums in (10) (to be discussed in Section 3.3). The primal step in our algorithm is summarized in (16) in Algorithm 1.

### 3.3 Low-complexity stochastic variance reduced estimators

We now proceed to explain the design of the variance reduced gradient estimators in both the dual and the primal updates. The main idea is inspired by the stochastic variance reduced gradient (SVRG) algorithm [6]. Specifically, for a vector-valued function , we can construct its SVRG estimator at each iteration step by using the following expression:

 δk =hik(θ)−hik(~θ)+h(~θ), (17)

where is a randomly sampled index from , and is a reference variable that is updated periodically (to be explained below). The first term in (17) is an unbiased estimator of and is generally known as the stochastic gradient when is the gradient of a certain cost function. The last two terms in (17) construct a control variate that has zero mean and is negatively correlated with , which keeps unbiased while significantly reducing its variance. The reference variable is usually set to be a delayed version of : for example, after every updates of , it can be reset to the most recent iterate of . Note that there is a trade-off in the choice of : a smaller further reduces the variance of since will be closer to and the first two terms in (17) cancel more with each other; on the other hand, it will also require more frequent evaluations of the costly batch term , which has a complexity of .

Based on (17), we develop two stochastic variance reduced estimators, (12) and (15), to approximate the finite-sums in (9) and (10), respectively. The dual gradient estimator in (12) is constructed in a standard manner using (17), where the reference variable is a delayed version of . 222As in [6], we also consider Option II wherein is randomly chosen from the previous ’s. On the other hand, the primal gradient estimator in (15) is constructed by using reference variables ; that is, we uses the most recent as the dual reference variable, without any delay. As discussed earlier, such a choice leads to a smaller variance in the stochastic estimator at a potentially higher computation cost (from more frequent evaluation of the batch term). Nevertheless, we are able to show that, with the dual coordinate ascent structure in our algorithm, the batch term in (15), which is the summation in (10) evaluated at , can be computed efficiently. To see this, note that, after each dual update step in (13), only one term inside this summation in (10), has been changed, i.e., the one associated with . Therefore, we can correct for this term by using recursion (14), which only requires an extra -complexity per step (same complexity as (15)).

Note that SVRPDA-I (Algorithm 1) requires to compute and store all the in (11), which is -complexity in storage and could be expensive in some applications. To avoid the cost, we develop a variant of Algorithm 1, named as SVRPDA-II (see Algorithm 2 in the supplementary material), by approximating in (14) with , where is another randomly sampled index from , independent of all other indexes. By doing this, we can significantly reduce the memory requirement from in SVRPDA-I to in SVRPDA-II (see Section 4.2). In addition, experimental results in Section 5 will show that such an approximation only cause slight performance loss compared to that of SVRPDA-I algorithm.

## 4 Theoretical Analysis

### 4.1 Convergence and computation complexity

We now provide convergence analysis of the SVRPDA-I algorithm and also study their complexities in computation and storage. To begin with, we first introduce the following assumptions.

###### Assumption 4.1.

The function is -strongly convex in , and each is -smooth.

###### Assumption 4.2.

The merit functions are Lipschitz with a uniform constant :

 |ϕi(u)−ϕi(u′)| ≤Bw∥u−u′∥,∀u,u′;∀i=0,…,nX−1.
###### Assumption 4.3.

is -smooth in , and has bounded gradients with constant :

 ∥f′θ1(xi,yij)−f′θ2(xi,yij)∥≤Bθ∥θ1−θ2∥,∥f′θ(xi,yij)∥≤Bf,∀θ,θ1,θ2,∀i,j.
###### Assumption 4.4.

For each given in its domain, the function defined in (8) is convex in :

 L(θ1,w)−L(θ2,w)≥⟨L′θ(θ2,w),θ1−θ2⟩,∀θ1,θ2.

Based on the above assumptions, we establish the non-asymptotic error bounds for SVRPDA-I (using either Option I or Option II in Algorithm 1). The main results are summarized in the following theorems, and their proofs can be found in Appendix D.

###### Theorem 4.5.

Suppose Assumptions 4.14.4 hold. If in Algorithm 1 (with Option I) we choose

 αθ =μ−164nX(B2fμγ+B2θB2wμ2)+nX,αw=nXμγαθ,M=⌈78.8nX(B2fμγ+B2θB2wμ2)+1.3nX+1.3⌉

where denotes the roundup operation, then the Lyapunov function satisfies , where . Furthermore, the overall computational cost (in number of oracle calls333One oracle call is defined as querying , , or for any and .) for reaching is upper bounded by

 O((nXnY+nXκ+nX)ln(1ϵ)). (18)

where, with a slight abuse of notation, is defined as .

###### Theorem 4.6.

Suppose Assumptions 4.14.4 hold. If in Algorithm 1 (with Option II) we choose

 αθ=(25B2fγ+10BθBw+80B2wB2θμ)−1,αw=μ40B2f,M=max(10αθμ,2nXαwγ,4nX),

then . Furthermore, let . Then, the overall computational cost (in number of oracle calls) for reaching is upper bounded by

 O((nXnY+nXκ+nX)ln(1ϵ)). (19)

The above theorems show that the Lyapunov function for SVRPDA-I converges to zero at a linear rate when either Option I or II is used. Since , they imply that the computational cost (in number of oracle calls) for reaching is also upper bounded by (18) and (19).

### 4.2 Storage complexity

We now briefly discuss and compare the storage complexities of both SVRPDA-I and SVRPDA-II. In Table 1, we report the itemized and total storage complexities for both algorithms, which shows that SVRPDA-II significantly reduces the memory footprint. We also observe that the batch quantities in (11), especially , dominates the storage complexity in SVRPDA-I. On the other hand, the memory usage in SVRPDA-II is more uniformly distributed over different quantities. Furthermore, although the total complexity of SVRPDA-II, , grows with the number of samples , the term is relatively small because the dimension is small in many practical problems (e.g., in (3) and (4)). This is similar to the storage requirement in SPDC [22] and SAGA [4].

### 4.3 Special cases

To our best knowledge, since the general objective function (1) has not been not considered before in the literature, it is useful to examine some special cases and compare them to the existing bounds. Specifically, we consider two special cases in this subsection.

First, we consider the case of , which simplifies the objective function (1) so that there is no empirical average outside . This takes the form of the unsupervised learning objective function that appears in [12]. As an immediate Corollary, Theorems 4.54.6 imply that the overall complexity of SVRPDA-I (with Option I or II) is

 O((nY+κ)ln(1ϵ)). (20)

Note that our results enjoys a linear convergence rate (i.e., log-dependency on ) due to the variance reduction technique. In contrast, stochastic primal-dual gradient (SPDG) method in [12], which does not use variance reduction, can only have sublinear convergence rate (i.e., ).

Second, we consider the case where for all and being linear function in . This simplifies (1) to the problem considered in [22] (and references therein), known as the regularized empirical risk minimization of linear predictors. This problem formulation has applications in support vector machines, regularized logistic regression, and more, depending on how the merit function is defined.

In this special case, our SVRPDA-I algorithm will become a single-loop algorithm: the outer-loop in Algorithm 1 is no longer needed. To see this, first note that when , is independent of because the last two terms in (12) would cancel each other. Second, when is linear in , the term in (14) and in (11) are independent of , which further implies that (that is recursively defined in (14)) is also independent of . Finally, we also note that, with linear , the first two terms in (15) cancel with each other, so that is independent of . As a result, the inner loop in Algorithm 1 does not require an outer-loop to update the reference variable .

The following theorem establishes the complexity bound for the SVRPDA-I algorithm in the second special case (see Appendix E for the proof).

###### Theorem 4.7.

Suppose Assumptions 4.14.4 hold. Furthermore, suppose and is a linear function of . Consider just the the inner loop of Algorithm 1, with s = 1 fixed, and

 αθ =γ16B2f+4nXμγ,andαw=12γ3nX+κ+1nX+κ+1

where is the condition number. Then, the Lyapunov function

 Δ(k):=(12αθ+μ)E{∥θ(k)−θ∗∥2∣∣Fk}+(12αw+γ)E{∥w(k)−w∗∥2∣∣Fk}

satisfies . Furthermore, the overall computational cost (in number of oracle calls) for reaching is upper bounded by

 O((nX+κ)ln(1ϵ)). (21)

In comparison, the authors in [22] propose a stochastic primal dual coordinate (SPDC) algorithm for this special case and prove an overall complexity of to achieve an -error solution, where the condition number . This is by far the best complexity for this class of problems. It is interesting to note that the complexity result in (21) and the complexity result in [22] only differ in their dependency on . This difference is most likely due to the acceleration technique that is employed in the primal update of the SPDC algorithm. We conjecture that the dependency on the condition number of SVRPDA-I can be further improved using a similar acceleration technique.

## 5 Experiments

In this section we consider the problem of risk-averse learning for portfolio management optimization [8, 9], introduced in Section 2. Specifically, we want to solve the optimization problem (4) for a given set of reward vectors . As we discussed in Section 2, we adopt the alternative formulation (6) for the second term so that it becomes a special case of our general problem (1). Then, we rewrite the cost function into a min-max problem by following the argument in Section 3.1 and apply our SVRPDA-I and SVRPDA-II (see Appendix C.1 for the details).

We evaluate our algorithms on 18 real-world US Research Returns datasets obtained from the Center for Research in Security Prices (CRSP) website444The processed data in the form of .mat file was obtained from https://github.com/tyDLin/SCVRG, with the same setup as in [9]. In each of these datasets, we have and . We compare the performance of our proposed SVRPDA-I and SVRPDA-II algorithms555The choice of the hyper-parameters can be found in Appendix C.2, and the code will be released publicly. with the following state-of-the art algorithms designed to solve composition optimization problems: (i) Compositional-SVRG-1 (Algorithm  of [8]), (ii) Compositional-SVRG-2 (Algorithm  of [8]), (iii) Full batch gradient descent, and (iv) ASCVRG algorithm [9]. For the compositional-SVRG algorithms, we follow [8] to formulate it as a special case of the form (2) by using the identification (5). Note that we cannot use the identification (6) for the compositional SVRG algorithms because it will lead to the more general formulation (1) with depending on both and . For further details, the reader is referred to [8].

As in previous works, we compare different algorithms based on the number of oracle calls required to achieve a certain objective gap (the difference between the objective function evaluated at the current iterate and at the optimal parameters). One oracle call is defined as accessing the function , its derivative , or for any and . The results are shown in Figure 1, which shows that our proposed algorithms significantly outperform the baseline methods on all datasets. In addition, we also observe that SVRPDA-II also converges at a linear rate, and the performance loss caused by the approximation is relatively small compared to SVRPDA-I.

## 6 Related Works

Composition optimization have attracted significant attention in optimization literature. The stochastic version of the problem (2), where the empirical averages are replaced by expectations, is studied in [18]. The authors propose a two-timescale stochastic approximation algorithm known as SCGD, and establish sublinear convergence rates. In [19], the authors propose the ASC-PG algorithm by using a proximal gradient method to deal with nonsmooth regularizations. The works that are more closely related to our setting are [8] and [9], which consider a finite-sum minimization problem (2) (a special case of our general formulation (1)). In [8], the authors propose the compositional-SVRG methods, which combine SCGD with the SVRG technique from [6] and obtain linear convergence rates. In [9], the authors propose the ASCVRG algorithms that extends to convex but non-smooth objectives.

Different from all these works, we take a primal-dual approach by fully exploit the dual decomposition and the finite-sum structures to efficiently solve the problem.

On the other hand, problems similar to (1) (and its stochastic versions) are also examined in different specific machine learning problems. [16] considers the minimization of the mean square projected Bellman error (MSPBE) for policy evaluation, which has an expectation inside a quadratic loss. The authors propose a two-timescale stochastic approximation algorithm, GTD2, and establish its asymptotic convergence. [10] and [13] independently showed that the GTD2 is a stochastic gradient method for solving an equivalent saddle-point problem. In [2] and [3], the authors derived saddle-point formulations for two other variants of costs (MSBE and MSCBE) in the policy evaluation and the control settings, and develop their stochastic primal-dual algorithms. All these works consider the stochastic version of the composition optimization and the proposed algorithms have sublinear convergence rates. In [5], different variance reduction methods are developed to solve the finite-sum version of MSPBE and achieve linear rate. Besides, problem of the form (1) was also studied in the context of unsupervised learning [11, 21] in the stochastic setting (with expectations in (1)).

Finally, our work is inspired by the stochastic variance reduction techniques in optimization [7, 6, 4, 1, 22], which considers the minimization of a cost that is a finite-sum of many component functions. Different versions of variance reduced stochastic gradients are constructed in these works to achieve linear convergence rate. In particular, our variance reduced stochastic estimators are constructed based on the idea of SVRG [6] with a novel design of the control variates. Our work is also related to the SPDC algorithm [22], which also integrates dual coordinate ascent with variance reduced primal gradient. However, our work is different from SPDC in the following aspects. First, we consider a more general composition optimization problem (1) while SPDC focuses on regularized empirical risk minimization with linear predictors, i.e., and is linear in . Second, because of the composition structures in the problem, our algorithms also need SVRG in the dual coordinate ascent update, while SPDC does not. Third, the primal update in SPDC is specifically designed for linear predictors. In contrast, our work is not restricted to that by using a novel variance reduced gradient.

## 7 Conclusions and Future Work

We developed a stochastic primal-dual algorithms, SVRPDA-I to efficiently solve the empirical composition optimization problem. This is achieved by fully exploiting the rich structures inherent in the reformulated min-max problem, including the dual decomposition and the finite-sum structures. It alternates between (i) a dual step of stochastic variance reduced coordinate ascent and (ii) a primal step of stochastic variance reduced gradient descent. In particular, we proposed a novel variance reduced gradient for the primal update, which achieves better variance reduction with low complexity. We derived its non-asymptotic bound and show that it converges at linear rate when the problem is strongly convex. Moreover, we also developed an approximate version of the algorithm named SVRPDA-II, which further reduces the storage complexity. Experimental results on several real-world benchmarks showed that both SVRPDA-I and SVRPDA-II significantly outperform existing techniques on all these tasks, and the approximation in SVRPDA-II only caused a slight performance loss. Future extensions of our work include the theoretical analysis of SVRPDA-II, the generalization of our algorithms to Bregman divergences, and applying it to large-scale machine learning problems with non-convex cost functions (e.g., unsupervised sequence classifications).

## References

• [1] P. Balamurugan and F. Bach. Stochastic variance reduction methods for saddle-point problems. In Proc. Advances in Neural Information Processing Systems (NIPS), pages 1416–1424, 2016.
• [2] B. Dai, N. He, Y. Pan, B. Boots, and L. Song. Learning from conditional distributions via dual embeddings. In Artificial Intelligence and Statistics, pages 1458–1467, 2017.
• [3] B. Dai, A. Shaw, L. Li, L. Xiao, N. He, Z. Liu, J. Chen, and L. Song. SBEED: Convergent reinforcement learning with nonlinear function approximation. In Proc. International Conference on Machine Learning (ICML), pages 1133–1142, 2018.
• [4] A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Proc. Advances in neural information processing systems (NIPS), pages 1646–1654, 2014.
• [5] S. S. Du, J. Chen, L. Li, L. Xiao, and D. Zhou. Stochastic variance reduction methods for policy evaluation. In Proc. International Conference on Machine Learning (ICML), pages 1049–1058, 2017.
• [6] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Proc. Advances in neural information processing systems, pages 315–323, 2013.
• [7] N. Le Roux, M. W. Schmidt, F. R. Bach, et al. A stochastic gradient method with an exponential convergence rate for finite training sets. In Proc. Advances in Neural Information Processing Systems (NIPS), pages 2672–2680, 2012.
• [8] X. Lian, M. Wang, and J. Liu. Finite-sum composition optimization via variance reduced gradient descent. In Proc. International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1159–1167, 2017.
• [9] T. Lin, C. Fan, M. Wang, and M. I. Jordan. Improved oracle complexity for stochastic compositional variance reduced gradient. arXiv preprint arXiv:1806.00458, 2018.
• [10] B. Liu, J. Liu, M. Ghavamzadeh, S. Mahadevan, and M. Petrik. Finite-sample analysis of proximal gradient td algorithms. In Proc. Conference on Uncertainty in Artificial Intelligence, pages 504–513, 2015.
• [11] Y. Liu, J. Chen, and L. Deng. Unsupervised sequence classification using sequential output statistics. In Proc. Advances in Neural Information Processing Systems, pages 3550–3559, 2017.
• [12] Y. Liu, J. Chen, and L. Deng. Unsupervised sequence classification using sequential output statistics. In Advances in Neural Information Processing Systems, pages 3550–3559, 2017.
• [13] S. V. Macua, J. Chen, S. Zazo, and A. H. Sayed. Distributed policy evaluation under multiple behavior strategies. IEEE Transactions on Automatic Control, 60(5):1260–1274, 2015.
• [14] R. T. Rockafellar. Convex analysis. Princeton university press, 2015.
• [15] A. Ruszczyński and A. Shapiro. Optimization of risk measures. In Probabilistic and randomized methods for design under uncertainty, pages 119–157. Springer, 2006.
• [16] R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesvári, and E. Wiewiora. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proc. International Conference on Machine Learning (ICML), pages 993–1000. ACM, 2009.
• [17] V. Vapnik. Statistical learning theory. 1998, volume 3. Wiley, New York, 1998.
• [18] M. Wang, E. X. Fang, and H. Liu. Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions. Mathematical Programming, 161(1-2):419–449, 2017.
• [19] M. Wang, J. Liu, and E. Fang. Accelerating stochastic composition optimization. In Proc. Advances in Neural Information Processing Systems, pages 1714–1722, 2016.
• [20] L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.
• [21] C.-K. Yeh, J. Chen, C. Yu, and D. Yu. Unsupervised speech recognition via segmental empirical output distribution matching. In Proc. International Conference on Learning Representations, 2019.
• [22] Y. Zhang and L. Xiao. Stochastic primal-dual coordinate method for regularized empirical risk minimization. Journal of Machine Learning Research, 18(1):2939–2980, 2017.

Appendix

## Appendix A Solving (1) in the main paper directly by SGD is biased

Applying the standard chain rule, we obtain the gradient of the cost function in (1) as

 1nXnX−1∑i=0ϕ′i(¯¯¯fi(θ))¯¯¯f′i(θ) (22)

where:

 ¯¯¯fi(θ) ≜1nYinYi−1∑j=0fθ(xi,yij) (23) ¯¯¯f′i(θ) ≜1nYinYi−1∑j=0f′θ(xi,yij)

and denotes the matrix, with its element defined to be:

 [f′θ(x,y)]i,j=∂∂θi[(fθ(x,y)]j (24)

Note from (22) that there are empirical averages inside and outside . Therefore, if we sample these empirical averages simultaneously, the stochastic gradient estimator would be biased. In other words, a direct application of stochastic gradient descent to (1) would be intrinsically biased.

## Appendix B SVRPDA-II Algorithm

Algorithm 2 in this supplementary material summarizes the full details of the SVRPDA-II algorithm, which was developed in Section 3.3 of the main paper. Note that it no longer requires the computation or the storage of in (25). Also note that the in (28) is replaced with now.

## Appendix C Experiment details

### c.1 Implementation details in risk-averse learning

As we discussed in Section 2, we adopt the alternative formulation (6) for the second term so that it becomes a special case of our general problem (1). Then, using the argument in Section 3.1, the second term in (4) can be rewritten into the objective in (8). Combining it with the first term in (4), the original problem (4) can be reformulated into the following equivalent min-max form:

 minθ∈Rdmaxw1nn−1∑i=0(⟨1nn−1∑j=0⟨xi−xj,θ⟩,wi⟩−