Biased Stochastic Gradient Descent for Conditional Stochastic Optimization

# Biased Stochastic Gradient Descent for Conditional Stochastic Optimization

## Abstract

Conditional Stochastic Optimization (CSO) covers a variety of applications ranging from meta-learning and causal inference to invariant learning. However, constructing unbiased gradient estimates in CSO is challenging due to the composition structure. As an alternative, we propose a biased stochastic gradient descent (BSGD) algorithm and study the bias-variance tradeoff under different structural assumptions. We establish the sample complexities of BSGD for strongly convex, convex, and weakly convex objectives, under smooth and non-smooth conditions. We also provide matching lower bounds of BSGD for convex CSO objectives. Extensive numerical experiments are conducted to illustrate the performance of BSGD on robust logistic regression, model-agnostic meta-learning (MAML), and instrumental variable regression (IV).

## 1 Introduction

We study a class of optimization problems, called Conditional Stochastic Optimization (CSO), in the form of

 minx∈XF(x):=\EEξfξ(\EEη|ξgη(x,ξ)), (1)

where , is a vector-valued function dependent on both random vectors and , is a continuous function dependent on the random vector , and the inner expectation is taken with respect to the conditional distribution of . Throughout, we only assume access to samples from the distribution and the conditional distribution rather than the cumulative distribution functions of these distributions.

CSO includes the classical stochastic optimization as a special case when is an identity function but is much more general. It has been recently utilized to solve a variety of applications in machine learning, ranging from the policy evaluation and control in reinforcement learning (Dai et al., 2017, 2018; Nachum and Dai, 2020), the optimal control in linearly-solvable Markov decision process (Dai et al., 2017), to instrumental variable regression in causal inference (Singh et al., 2019; Muandet et al., 2019).

One common challenge with these aforementioned applications is that in the extreme case, a few or even only one sample is available from the conditional distribution of for each given . To deal with this limitation, a primal-dual stochastic approximation algorithm was proposed to solve a min-max reformulation of CSO using the kernel embedding techniques (Dai et al., 2017). However, this approach requires convexity of and linearity of , which are not satisfied in general applications especially when neural networks are used.

On the other hand, for many other applications, e.g., those in invariant learning and meta-learning, we do have access to multiple samples from the conditional distribution. Take the model-agnostic meta-learning (MAML) (Finn et al., 2017) as an example. It belongs to a broader concept called few-shot meta-learning, which uses models obtained from previous tasks on new tasks with possibly only a small amount of data (Thrun and Pratt, 2012). MAML itself aims to learn a meta-initialization parameter using metadata for similar learning tasks that are trained via gradient-based algorithms (Finn et al., 2017). It can be formulated into the following optimization problem:

 minw\EEi∼p,Diqueryli(\EEDisupport(w−α∇li(w,Disupport)),Diquery), (2)

where represents the distribution of different tasks, and correspond to support (training) data and query (testing) data of the task to be trained on, is the loss function on data from task , and is a fixed meta step size. Setting and , (2) is clearly a special case of CSO for which multiple samples can be drawn from the conditional distribution of . Since the loss function generally involves neural networks, the resulting CSO is usually a nonconvex problem. Thus, the previous primal-dual algorithm and dual embedding techniques developed in Dai et al. (2017) simply do not apply.

In this paper, we focus on the general CSO problem where multiple samples from the conditional distribution are available, and the objective is not necessarily in the compositional form of a convex loss and a linear mapping . A closely related work is Hu et al. (2019), in which the authors study the generalization error bound and sample complexity of empirical risk minimization (ERM), a.k.a., sample average approximation (SAA) for general CSO. Differently, here we aim at developing efficient stochastic gradient-based methods that directly solve the CSO problem (1), for both convex and nonconvex settings.

Due to the composition structure in CSO problem (1), constructing unbiased gradient estimators with proper variance is not possible. Instead, we introduce a simple biased stochastic gradient descent (BSGD) method that leverages a mini-batch of conditional samples to construct the gradient estimator with controllable bias. We establish the sample complexities of BSGD for strongly convex, convex, and weakly convex objectives. For convex problems, this refers to the sample complexity required to find an -optimal solution in expectation; for nonconvex problems, this refers to the sample complexity required to find an -stationary point in expectation. Some of our results are summarized in Table 1 with a comparison to the sample complexity of SAA (or ERM) established in Hu et al. (2019).

### 1.1 Our Contributions

Our contributions are two-fold:

First, we show that for general convex CSO problems, BSGD attains the same total sample complexity as the empirical risk minimization, and the sample complexity significantly reduces when either smoothness or strong convexity conditions holds. More specifically, for convex and Lipschitz continuous objectives, BSGD requires at most total samples to find an -optimal solution. If further the outer random function is smooth or the objective is strongly convex, the sample complexity reduces from to . This improves to if both conditions hold, where we use to represent the rate hiding the logarithmic factors. In contrast to the SAA results in Hu et al. (2019), these sample complexities are independent of the problem’s dimensions. Furthermore, we show that, for weakly convex CSO problems (which are not necessarily smooth nor convex), BSGD requires a total sample complexity of to achieve an -stationary point. When is smooth, the total sample complexity is reduced to . To our best knowledge, these are the first sample complexity results of stochastic oracle-based algorithms for solving general CSO. In addition, for general convex CSO, we demonstrate that the complexity bounds of BSGD are indeed tight from an information-theoretic lower bound perspective.

Second, BSGD serves as a viable alternative to solve MAML and invariant learning, yet with theoretical guarantees. Particularly for MAML, BSGD converges to a stationary point under simple deterministic stepsize rules and appropriate inner mini-batch sizes. In contrast, the commonly used first-order MAML algorithm (Finn et al., 2017), which ignores the Hessian information, is not guaranteed to converge even for a large inner mini-batch size. For smooth MAML, our algorithm attains the same sample complexity as the recent algorithm introduced in Fallah et al. (2019); however, the latter requires the use of stochastic stepsizes and mini-batch of outer samples at each iteration, which is less practical. Numerically, we further demonstrate that BSGD and its Adam variation achieve the state-of-the-art performance with properly tuned hyperparameters for some MAML tasks. We also conduct numerical experiments for several other applications in invariant logistic regression and instrumental variable regression, and illustrate the delicate bias-variance tradeoff in practice between the inner mini-batch size and the number of iterations, which further support our theoretical findings.

### 1.2 Related Work

Stochastic optimization. The traditional stochastic optimization, is a special case of CSO. SGD and its numerous variants, form one of the most important families of algorithms for solving stochastic optimization. For upper and lower bounds on stochastic gradient-based methods for convex and strongly convex objectives, see seminal works and reference therein (Nemirovsky and Yudin, 1985; Nemirovski et al., 2009). For weakly convex objectives, SGD achieves stationary convergence (Ghadimi and Lan, 2013; Zhang and He, 2018; Davis and Drusvyatskiy, 2019). Recently, Drori and Shamir (2019); Arjevani et al. (2019) showed an lower bound on stationary convergence for SGD and unbiased stochastic first-order algorithms, respectively.

Stochastic Composition Optimization. This class of problems deals with objectives in the form of:

 minx∈Xf∘g(x):=\EEξ[fξ(\EEη[gη(x)])], (3)

where , and . Several stochastic first-order algorithms have been developed to solve (3) as well as its corresponding ERM; see, e.g., Wang et al. (2016, 2017); Lian et al. (2017); Ghadimi et al. (2018); Zhang and Xiao (2019a, b, c), just to name a few. A key difference between general CSO and Problem (3) is that the latter could be formulated as a composition of two deterministic functions while (1) cannot. As a result, the aforementioned algorithms that hinge upon this special structure, are not directly applicable to solving CSO.

### 1.3 Notations

Throughout the paper, we use to denote the projection operator, i.e., . We use to represent the rate hiding the logarithmic factors. A function is -Lipschitz continuous on if holds for any . A function is -Lipschitz smooth on if holds for any . We say a function is -convex on if for any

 f(x)−f(y)−∇f(y)⊤(x−y)≥μ2∥x−y∥22. (4)

Note that , , and correspond to being strongly convex, convex, and weakly convex, respectively.

## 2 Biased Stochastic Gradient Descent

For simplicity, throughout, we assume that the random functions and are continuously differentiable. Based on the special composition structure of CSO and chain rule, the gradient of in (1) is given by

 ∇F(x)=(\EEη|ξ∇gη(x,ξ)])⊤\EEξ∇fξ(\EEη|ξgη(x,ξ)),

In general, it is costly and even impossible to construct an unbiased stochastic estimator of the gradient and to apply SGD. Instead, we consider a biased estimator of using one sample and i.i.d. samples from the conditional distribution of in the following form:

 ∇ˆF(x;ξ,{ηj}mj=1):=

Note that is essentially the gradient of the empirical objective . Based on this biased gradient estimator, we propose BSGD, which is formally described in Algorithm 1. When using fixed inner mini-batch sizes , BSGD could be treated as the classical SGD on an approximation function . Throughout the paper, we make the following assumptions:

{assume}

We assume that

• is convex.

• There exists such that

• .

• is -Lipschitz continuous on for any .

These assumptions are widely used in the convergence analysis of stochastic gradient-based algorithms, where (b) corresponds to the assumption on the bounded second moment of the gradient estimator. We use the following assumption to characterize the (non)convexity of objectives. {assume} and are -convex for any .

Strong convexity, namely , can be achieved by adding -regularization to convex objectives. On the other spectrum, weak convexity, namely , is a commonly used assumption in nonconvex optimization literature (Paquette et al., 2017; Chen et al., 2019; Davis and Drusvyatskiy, 2019). It is satisfied by a variety of objectives used in machine learning, including some of those involved with neural networks.

Before presenting the main results, we make another observation that the bias of the function estimator on , induced by the composition of functions, depends on the smoothness condition of the outer function . Notice that for -Lipschitz smooth , we have

 \EE[fξ(Y)−fξ(\EEY)]≤S2\EE∥Y−\EEY∥22;

and for -Lipschitz continuous , we have

 \EE[fξ(Y)−fξ(\EEY)]≤Lf\EE∥Y−\EEY∥2,

where is a random variable. The following lemma formally characterizes the estimation bias of .

{lm}

[Hu et al. (2019)] Under Assumption 2, for a sample and i.i.d. samples from the conditional distribution , and any that is independent of and , we have \beq \EE_{ξ,{η_j}_j=1^m } ^F(x; ξ,{η_j}_j=1^m) - F(x) ≤Lfσgm. \eeqIf additionally, is -Lipschitz smooth, we have \beq \EE_{ξ,{η_j}_j=1^m } ^F(x; ξ,{η_j}_j=1^m) - F(x) ≤Sσg22m. \eeq Lemma 2 implies that to control the bias of the estimator up to , number of samples are needed for Lipschitz continuous whereas only number of samples are needed for Lipschitz smooth .

## 3 Convergence of BSGD

In this section, we begin by analyzing the global convergence of BSGD for strongly convex objectives and convex objectives. Then for the general case of weakly convex objectives, we analyze the convergence to an approximate stationary point. All of these results can be extended to the mini-batch setting in which BSGD uses multiple i.i.d. samples of instead of one at each iteration. All proofs are deferred to Appendix A.

### 3.1 Convergence for Strongly Convex Objectives

For strongly convex objectives , i.e., the functions satisfying Assumption 2 with , we have the following result.

{thm}

Under Assumptions 2 and 2 with , with stepsizes , the output of BSGD satisfies:

 \EE[F(ˆxT)−minx∈XF(x)]≤M2(log(T)+1)μT+1TT∑t=12Lfσg√mt.

If additionally, is -Lipschitz smooth, we have,

 \EE[F(ˆxT)−minx∈XF(x)]≤M2(log(T)+1)μT+1TT∑t=1Sσ2gmt.

Theorem 3.1 implies that to find an -optimal solution, the number of iterations is in order of which assents with the performance of SGD with the averaging scheme for strongly convex objectives (Shamir and Zhang, 2013). As for sample complexity, one could either use fixed mini-batch sizes , or time-varying mini-batch sizes for Lipschitz continuous and for Lipschitz smooth to establish the following result.

{cor}

Under the same assumptions as Theorem 3.1, to achieve an -optimal solution, the total sample complexity of BSGD is . If further assuming Lipschitz smooth , the sample complexity of BSGD is . The above result indicates that smoothness conditions make a difference in the total sample complexity of BSGD when solving CSO. It is worth pointing out that the sample complexity of BSGD matches with the that of ERM for strongly convex objectives esatblished in Hu et al. (2019).

### 3.2 Convergence for General Convex Objectives

We now turn to the convergence for convex objectives, i.e., these satisfying Assumption 2 with . Denote , where refers to the optimal solution set of over and is the initial point used in BSGD.

{thm}

Under Assumptions 2 and 2 with , with stepsizes for a positive constant , the output of BSGD satisfies

 \EE[F(ˆxT)−minx∈XF(x)]≤M2c2+D22c√T+1TT∑t=12Lfσg√mt.

Furthermore, if is -Lipschitz smooth, we have

 \EE[F(ˆxT)−minx∈XF(x)]≤M2c2+D22c√T+1TT∑t=1Sσ2gmt.

Comparing to Theorem 3.1, without strong convexity condition, the iteration complexity increases from to . Similarly, by setting the fixed mini-batch sizes , or the time-varying sizes for Lipschitz continuous and for Lipschitz smooth , we can obtain the following result.

{cor}

Under the same assumptions as Theorem 3.2, to achieve an -optimal solution, the total sample complexity required by BSGD is . If further assuming Lipschitz smooth , the sample complexity required by BSGD becomes .

### 3.3 Stationary Convergence for Nonconvex Objectives

In this subsection, we analyze the convergence of BSGD for finding an approximate stationary point when the objective is weakly convex. We first define the Moreau envelope () of the function and its corresponding minimizer:

 Fλ(x):=minz{F(z)+12λ||z−x||22},proxλF(x):=argminz{F(z)+12λ||z−x||22}.

Based on Moreau envelope, we define the gradient mapping:

 Gλ(x):=1λ||proxλF(x)−x||2.

We call , generated by a randomized algorithm, an -stationary point of if , where the expectation is taken with respect to the randomness in . This convergence criteria is commonly used in nonconvex optimization literature (Beck, 2017; Drusvyatskiy, 2017). We have the following result.

{thm}

Under Assumptions 2 and 2 with , for a given total iteration number , let stepsizes for a positive constant and inner mini-batch sizes for some integer . The output of BSGD, , selected uniformly at random from , satisfies

If we further assume is -Lipschitz smooth, we have

To the best of our knowledge, this is the first non-asymptotic convergence guarantee for CSO in the nonconvex setting. {cor} Under the same assumptions as Theorem 3.3, to achieve an -stationary point, the total sample complexity required by BSGD is at most . If further assuming Lipschitz smooth , the sample complexity is at most .

Specifically, the total sample complexity is obtained by setting the number of iterations and the inner batch size or in the smooth case. Note that even when we use an infinite number of i.i.d. samples from the distribution , namely, , to construct an unbiased gradient estimator at each iteration, BSGD (in which case, reduces to SGD) would still need iterations to obtain an -stationary point in general. This is also known to be the optimal sample complexity of finding an -stationary point using unbiased first-order algorithms for weakly convex objectives (Arjevani et al., 2019; Drori and Shamir, 2019). Thus, the iteration complexity of BSGD is optimal.

We also provide a convergence guarantee using decaying stepsizes. {cor} (Decaying Stepsizes) Let , inner batch size , and stepsizes with . If the output is chosen from where , we have,

 E[G21/(2|μ|)(ˆxR)]≤2F1/(2|μ|(x1)−2minx∈XF(x)+4|μ|c2M2lnTc√T+8|μ|Lfσg√m; (5)

if we further assume is -Lipschitz smooth, we have

 E[G21/(2|μ|)(ˆxR)]≤2F1/(2|μ|)(x1)−2minx∈XF(x)+4|μ|c2M2lnTc√T+4|μ|Sσ2gm. (6)

## 4 Lower Bounds of BSGD in Convex Setting

In this section, we discuss the lower bounds on the expected error of BSGD using the well-known oracle model. The oracle model consists of three components: a function class of interests , an algorithm class , and an oracle class in which an oracle that takes a query point from an algorithm about a function and reveals back to the algorithm some information about . Function class of interest, algorithms class, and detailed proof are deferred to Section B in the Appendix.

{defn}

[Biased Stochastic Function Value Oracle] A biased stochastic function value oracle has two parameters and . For a query at point of function given by an algorithm, takes a sample from distribution , and returns to the algorithm , which satisfies the following conditions:

• ,

• ,

• .

We use the minimax error as the complexity measure:

 Δ∗T(F):=infA∈AsupF∈Fsupϕ∈Φ\EErΔAT(F,ϕ,X), (7)

where is the optimization error of a randomized algorithm for optimizing a function on a bounded convex set via querying oracles . If , it implies that for any algorithm , there exists a ‘hard’ function in and an oracle such that the expected error incurred by when optimizing querying after iterations is at least .

{thm}

For any algorithm that queries the biased stochastic function value oracle times, the minimax error satisfies the following bounds:

• for one-dimensional convex and Lipschitz continuous functions class , we have

 Δ∗T(F1L,0)≥B4+14√VT; (8)
• for one-dimensional -strongly convex and Lipschitz continuous functions class , we have

 Δ∗T(F1L,1)≥14(B+√VT)2. (9)

A general version of Theorem 4 for any dimension is provided in Appendix B. In the special case when , namely when the oracle returns to the algorithm an unbiased function estimator and its gradient, Theorem 4 reduces to the lower bounds on oracle complexity for classical stochastic convex optimization. The proof follows ideas from Hu et al. (2016) and is deferred to Appendix B. Invoking Lemma 2 and Theorem 4, we have the following lower bounds of BSGD.

{cor}

When minimizing a CSO objective using BSGD with a fixed mini-batch size and any stepsizes for iterations, there always exist samples such that

• for convex and -Lipschitz continuous ,

 F(ˆxT)−minx∈XF(x)≥\cO(Lfσg√m+√VT);
• for convex and -Lipschitz smooth ,

 F(ˆxT)−minx∈XF(x)≥\cO(Sσ2gm+√VT);
• for 1-strongly-convex and -Lipschitz continuous ,

 F(ˆxT)−minx∈XF(x)≥\cO⎛⎝L2fσ2gm+VT⎞⎠;
• for 1-strongly convex and -Lipschitz smooth ,

 F(ˆxT)−minx∈XF(x)≥\cO(Sσ4gm2+VT).

Comparing to Corollary 3.2, Corollary 4 implies that the sample complexity of BSGD for convex objectives is optimal and cannot be further improved. For strongly convex objectives, there exists a gap between the lower bound and upper bound. It remains interesting to explore whether the lower bounds can be further improved, which we leave for future investigation.

## 5 Applications and Numerical Experiments

In this section, we demonstrate the numerical performance of BSGD on invariant logistic regression, model-agnostic meta-learning, and instrumental variable regression. The parameter settings and detailed experiment results are deferred to Appendix C. The platform configuration used for the experiment is Intel Core i9-7940X CPU @ 3.10GHz, 32GB RAM, 64-bit Ubuntu 18.04.3 LTS.

### 5.1 Invariant Logistic Regression

Invariant learning has wide applications in machine learning and related areas (Mroueh et al., 2015; Anselmi et al., 2016). We consider a simple invariant logistic regression problem with the following formulation:

 minxEξ=(a,b)[log(1+exp(−bEη|ξ[η]Tx)], (10)

where is the model parameter to be trained, is the random feature vector, is the corresponding label, is a random perturbed observation of the feature .

We generate a synthetic dataset with , with , . We consider three different variances of : , corresponding to different perturbation levels. At each iteration, we use a fixed mini-batch size , namely samples of are generated for a given feature label pair . We fine-tune the stepsizes for BSGD using grid search. We compare the performance to the SAA approach (Hu et al., 2019) solved by CVXPY as a benchmark.

For a given total number of samples , the performance of BSGD with different inner batch sizes and under different perturbation levels is summarized in Figure 1. When increasing from to , larger inner batch sizes are needed to control the bias incurred by the biased gradient estimator of BSGD. This numerical observation supports our theoretical findings of the delicate trade-off between the inner batch size and the number of iterations. We also compare the performance achieved by BSGD and SAA, in terms of 4, by selecting the best inner batch sizes for a given . The results are summarized in Table 2. A detailed result can be found in Table 5 in Appendix C.1. We observe that given the same budget of samples, BSGD outperforms SAA and requires a much smaller inner batch size in practice.

### 5.2 Model-Agnostic Meta-Learning (MAML)

MAML has received tremendous popularity in the past years for few-shot supervised learning and meta reinforcement learning. Recall the formulation of MAML in (2). It is a special case of CSO (1) using the following mappings:

 fξ(w) = li(w,Diquery),where ξ=(i,Diquery), gη(w,ξ) = w−α∇li(w,Disupport),where η=Disupport.

Remark. (Convergence of BSGD for MAML) If the objective of MAML in (2) is weakly convex and is Lipschitz smooth in for any , namely is Lipschitz continuous and is Lipschitz smooth, then Corollary 3.3 implies BSGD converges to an -stationary point of (2) with sample complexity . This complexity, as we have also mentioned in the introduction, matches with the sample complexity of a recent algorithm for MAML discussed in Fallah et al. (2019), which additionally requires stochastic stepsizes and mini-batch of tasks (i.e., outer samples) at each iteration.

We take the widely used sine-wave few-shot regression task as an illustration. The goal is to train a model such that it can recover a new (unseen) sine-wave signal from only a few available data points. The sine wave is of the form where are undetermined parameters and is set to be in the region . In this experiment, we set , , where and is a neural network consisting of hidden layers with nodes and using ReLU activation function between each layers.

For a fixed total number of samples , we select the inner batch size and compare the performance of BSGD with first-order MAML (FO-MAML) (Finn et al., 2017; Fallah et al., 2019), where FO-MAML ignores the Hessian information when constructing the gradient estimator. Here we use the objective value as the measurement. Since the objective is analytically intractable, we evaluate the MAML objective via empirical objective obtained via empirical risk minimization:

 ˆF(w)=1ˆTˆT∑i=11ˆNˆN∑n=1li(w−α⋅1ˆMˆM∑m=1∇wli(w,Di,msupport);Di,nquery), (11)

where the three sample sizes , and are set to be 100, when computing the approximate loss function value, the sample tasks/data are selected randomly. We fine-tune the stepsizes for both BSGD and FO-MAML. Also, we compare the performances with Adam, which is applied in the original MAML paper (Finn et al., 2017), as for Adam, we use its default parameter settings in PyTorch for training.

Experiment results are summarized in Figure 2. The average loss and running time (in CPU minutes) over trials of each algorithm (under their best inner batch sizes) are given in Table 3. More detailed experiment results are provided in Table 6 in Section C.2 of the Appendix. Figure 2(a) demonstrates a tradeoff between inner batch size and number of iterations for BSGD. Figure 2(b) compares the convergences of the three algorithms with the best-tuned inner batch size, and the red line demonstrates a divergent case of FO-MAML when inner batch size is . Although FO-MAML requires the least running time, as shown in Table 3, its performance is worse than BSGD. BSGD achieves the least error among the three methods with a proper inner batch size of . BSGD is slightly faster than Adam and requires a smaller mini-batch size to achieve its best performance, which is more practical when some tasks only have a small number of samples. We also use the meta initialization parameters obtained by different methods as the initial model parameters to train an unseen sine wave regression task. Figure 2(c) shows the signals recovered after a one-step update on the unseen task with only samples. Here Random NN refers to a neural network with the same structure using a random initialization. We observe that BSGD, FO-MAML, Adam could recover most of the shape of the signal while Random NN fails to work without the meta initialization parameter.

### 5.3 Instrumental Variable Regression (IV)

IV is a fundamental task in econometrics, health care, and social science. It aims to estimate the causal effect of input on a confounded outcome with an instrument variable , which, for given , is conditionally independent of , namely . Since an unobservable confounder influences both , confounder, classical regression methods might fail. Recently, Muandet et al. (2019) showed that, instead of using the traditional two-stage procedure (Angrist and Pischke, 2008; Hartford et al., 2017), one could resort to solving an optimization problem formulated as:

 minw \EEYZL(Y,\EEX|Z[h(w,X)]), (12)

where is a function parameterized by and maps to , is the square loss function.

Similar to Lewis and Syrgkanis (2018); Bennett et al. (2019), we generate data via the following process:

 Z∼Uniform([−3,3]2);e∼N(0,1),γ,δ∼N(0,0.1);X=0.5Z+0.5e+γ;Y=g(X)+e+δ,

where is the confounder, are random noise. We consider four cases with different ground truth functions: (a) sine function , (b) step function , (c) absolute function , and (d) linear function . We compare the performance of BSGD, Random NN, the famous 2SLS (Angrist and Imbens, 1995), and Poly2SLS. Random NN predicts directly from using a neural network. Poly2SLS expands and via polynomial features, and performed 2SLS via Ridge regression. The neural networks applied in BSGD and Random NN have the same structure such that they both consist of hidden layers with nodes and ReLU activation function between each layer. For Random NN, we use the classical SGD to perform updates.

Figure 4 demonstrates the convergence of BSGD under different inner mini-batch sizes (run for five times for each case) on data generated by different ground truth functions. Figure 3 compares the recovered signal using BSGD, Random NN, 2SLS, and Poly2SLS. For BSGD and Random NN, we fine tune the stepsize and select the best the inner mini-batch size for BSGD. For 2SLS and Poly2SLS, we use the implementation released in Bennett et al. (2019)5. Table 4 compares the mean squared error of estimators obtained by four algorithms over the ground truth function. The test set contains samples pairs such that and is generated accordingly by the process stated above. We run each algorithm for ten times, then we calculate the mean and standard deviation of the mean squared error.

The results shows that 2SLS outperforms all other methods on linear ground truth case since the underlying assumption of 2SLS is linearity. BSGD achieves the least mean squared error on other three cases.

## 6 Conclusion

We propose a biased first-order algorithm, BSGD, for solving CSO problems. We analyze the upper bounds on the sample complexity of BSGD under various structural assumptions. We establish the first non-asymptotic convergence to an approximate stationary point result for nonconvex CSO, which also provides a theoretical guarantee for MAML. We show that the sample complexity of BSGD highly depends on the (non)convexity condition, which influences the iteration complexity, and the smoothness condition, which affects the estimation bias incurred by the biased gradient estimator. We also provide a matching lower bound of BSGD for convex objectives. As for strongly convex and weakly convex objectives, we conjecture that BSGD achieves optimal sample complexities, whereas constructing matching lower bounds remains an interesting and open problem.

## Appendix A Convergence Analysis

In this section, we demonstrate the proof of Theorems 3.1, 3.2 and 3.3. We also prove the Corollary 3.1. The proof of Corollaries 3.2 and 3.3 are similar. First we demonstrate the proof framework for strongly convex and convex objectives.

### a.1 Proof Framework for Strongly Convex and Convex Objectives

Recall BSGD algorithm 1, at iteration , BSGD first generate samples from distribution and samples from conditional distribution of . We define the following auxiliary functions to facilitate our analysis

 p(x,ξt):=fξt(\EEη|ξtgη(x,ξt));ˆp(x,ξt):=fξt(1mtmt∑j=1gηtj(x,ξt)).

Note that . The biased gradient estimator used in BSGD is . Denote , , . Since and that the projection operator is non-expansive, we have

 At+1=12\normxt+1−x∗22=12\normΠX(xt−γt∇xˆp(xt,ξt))−ΠX(x∗)22≤12\normxt−x∗−γt∇xˆp(xt,ξt)22=At+12γ2t\norm∇xˆp(xt,ξt)22−γt∇xˆp(xt,ξt))⊤(xt−x∗). (13)

Dividing and taking expectation on both side, we have,

 \EE∇xˆp(xt,ξt)⊤(xt−x∗)≤at−at+1γt+12γt\EE\norm∇xˆp(xt,ξt)22. (14)

By Assumption 2, we have

 −∇xˆp(xt,ξt)⊤(xt−x∗)≤ˆp(x∗,ξt)−ˆp(xt,ξt)−μ2∥xt−x∗∥22=ˆp(x∗,ξt)−p(x∗,ξt):=ζt1+p(x∗,ξt)−p(xt,ξt):=ζt2+p(xt,ξt)−ˆp(xt,ξt):=ζt3−μ2∥xt−x∗∥22 (15)

Taking expectation on both side, by the definition of , it holds , then

 −\EE∇xˆp(xt,ξt)⊤(xt−x∗)≤\EEζt1+\EEζt3+\EEF(x∗)−F(xt)−μ2\EE∥xt−x∗∥22. (16)

We also notice that and could be bounded by Lemma 2 as and are independent of .

Combining (14) and (16), recalling that , we obtain

 \EEF(xt)−F(x∗)≤\EEζt1+\EEζt3−μat+at−at+1γt+12γt\EE\norm∇xˆp(xt,ξt)22.

By convexity of , it holds

 \EE[F(ˆx)−F(x∗)]=\EE[