Biased Stochastic Gradient Descent for Conditional Stochastic Optimization
Abstract
Conditional Stochastic Optimization (CSO) covers a variety of applications ranging from metalearning and causal inference to invariant learning. However, constructing unbiased gradient estimates in CSO is challenging due to the composition structure. As an alternative, we propose a biased stochastic gradient descent (BSGD) algorithm and study the biasvariance tradeoff under different structural assumptions. We establish the sample complexities of BSGD for strongly convex, convex, and weakly convex objectives, under smooth and nonsmooth conditions. We also provide matching lower bounds of BSGD for convex CSO objectives. Extensive numerical experiments are conducted to illustrate the performance of BSGD on robust logistic regression, modelagnostic metalearning (MAML), and instrumental variable regression (IV).
1 Introduction
We study a class of optimization problems, called Conditional Stochastic Optimization (CSO), in the form of
(1) 
where , is a vectorvalued function dependent on both random vectors and , is a continuous function dependent on the random vector , and the inner expectation is taken with respect to the conditional distribution of . Throughout, we only assume access to samples from the distribution and the conditional distribution rather than the cumulative distribution functions of these distributions.
CSO includes the classical stochastic optimization as a special case when is an identity function but is much more general. It has been recently utilized to solve a variety of applications in machine learning, ranging from the policy evaluation and control in reinforcement learning (Dai et al., 2017, 2018; Nachum and Dai, 2020), the optimal control in linearlysolvable Markov decision process (Dai et al., 2017), to instrumental variable regression in causal inference (Singh et al., 2019; Muandet et al., 2019).
One common challenge with these aforementioned applications is that in the extreme case, a few or even only one sample is available from the conditional distribution of for each given . To deal with this limitation, a primaldual stochastic approximation algorithm was proposed to solve a minmax reformulation of CSO using the kernel embedding techniques (Dai et al., 2017). However, this approach requires convexity of and linearity of , which are not satisfied in general applications especially when neural networks are used.
On the other hand, for many other applications, e.g., those in invariant learning and metalearning, we do have access to multiple samples from the conditional distribution. Take the modelagnostic metalearning (MAML) (Finn et al., 2017) as an example. It belongs to a broader concept called fewshot metalearning, which uses models obtained from previous tasks on new tasks with possibly only a small amount of data (Thrun and Pratt, 2012). MAML itself aims to learn a metainitialization parameter using metadata for similar learning tasks that are trained via gradientbased algorithms (Finn et al., 2017). It can be formulated into the following optimization problem:
(2) 
where represents the distribution of different tasks, and correspond to support (training) data and query (testing) data of the task to be trained on, is the loss function on data from task , and is a fixed meta step size. Setting and , (2) is clearly a special case of CSO for which multiple samples can be drawn from the conditional distribution of . Since the loss function generally involves neural networks, the resulting CSO is usually a nonconvex problem. Thus, the previous primaldual algorithm and dual embedding techniques developed in Dai et al. (2017) simply do not apply.
Algorithm 
Assumptions  


Strongly Convex  Strongly Convex  Convex  Convex  Weakly Convex  Weakly Convex  

Smooth  Lipschitz  Smooth  Lipschitz  Smooth  Lipschitz  
SAA (Hu et al., 2019) 
    
BSGD (this paper) 

Goal: to find optimal solution for convex and stationary point for weakly convex . 
In this paper, we focus on the general CSO problem where multiple samples from the conditional distribution are available, and the objective is not necessarily in the compositional form of a convex loss and a linear mapping . A closely related work is Hu et al. (2019), in which the authors study the generalization error bound and sample complexity of empirical risk minimization (ERM), a.k.a., sample average approximation (SAA) for general CSO. Differently, here we aim at developing efficient stochastic gradientbased methods that directly solve the CSO problem (1), for both convex and nonconvex settings.
Due to the composition structure in CSO problem (1), constructing unbiased gradient estimators with proper variance is not possible. Instead, we introduce a simple biased stochastic gradient descent (BSGD) method that leverages a minibatch of conditional samples to construct the gradient estimator with controllable bias. We establish the sample complexities of BSGD for strongly convex, convex, and weakly convex objectives. For convex problems, this refers to the sample complexity required to find an optimal solution in expectation; for nonconvex problems, this refers to the sample complexity required to find an stationary point in expectation. Some of our results are summarized in Table 1 with a comparison to the sample complexity of SAA (or ERM) established in Hu et al. (2019).
1.1 Our Contributions
Our contributions are twofold:
First, we show that for general convex CSO problems, BSGD attains the same total sample complexity as the empirical risk minimization, and the sample complexity significantly reduces when either smoothness or strong convexity conditions holds. More specifically, for convex and Lipschitz continuous objectives, BSGD requires at most total samples to find an optimal solution. If further the outer random function is smooth or the objective is strongly convex, the sample complexity reduces from to . This improves to if both conditions hold, where we use to represent the rate hiding the logarithmic factors. In contrast to the SAA results in Hu et al. (2019), these sample complexities are independent of the problem’s dimensions. Furthermore, we show that, for weakly convex CSO problems (which are not necessarily smooth nor convex), BSGD requires a total sample complexity of to achieve an stationary point. When is smooth, the total sample complexity is reduced to . To our best knowledge, these are the first sample complexity results of stochastic oraclebased algorithms for solving general CSO. In addition, for general convex CSO, we demonstrate that the complexity bounds of BSGD are indeed tight from an informationtheoretic lower bound perspective.
Second, BSGD serves as a viable alternative to solve MAML and invariant learning, yet with theoretical guarantees. Particularly for MAML, BSGD converges to a stationary point under simple deterministic stepsize rules and appropriate inner minibatch sizes. In contrast, the commonly used firstorder MAML algorithm (Finn et al., 2017), which ignores the Hessian information, is not guaranteed to converge even for a large inner minibatch size. For smooth MAML, our algorithm attains the same sample complexity as the recent algorithm introduced in Fallah et al. (2019); however, the latter requires the use of stochastic stepsizes and minibatch of outer samples at each iteration, which is less practical. Numerically, we further demonstrate that BSGD and its Adam variation achieve the stateoftheart performance with properly tuned hyperparameters for some MAML tasks. We also conduct numerical experiments for several other applications in invariant logistic regression and instrumental variable regression, and illustrate the delicate biasvariance tradeoff in practice between the inner minibatch size and the number of iterations, which further support our theoretical findings.
1.2 Related Work
Stochastic optimization. The traditional stochastic optimization, is a special case of CSO. SGD and its numerous variants, form one of the most important families of algorithms for solving stochastic optimization. For upper and lower bounds on stochastic gradientbased methods for convex and strongly convex objectives, see seminal works and reference therein (Nemirovsky and Yudin, 1985; Nemirovski et al., 2009). For weakly convex objectives, SGD achieves stationary convergence (Ghadimi and Lan, 2013; Zhang and He, 2018; Davis and Drusvyatskiy, 2019). Recently, Drori and Shamir (2019); Arjevani et al. (2019) showed an lower bound on stationary convergence for SGD and unbiased stochastic firstorder algorithms, respectively.
Stochastic Composition Optimization. This class of problems deals with objectives in the form of:
(3) 
where , and . Several stochastic firstorder algorithms have been developed to solve (3) as well as its corresponding ERM; see, e.g., Wang et al. (2016, 2017); Lian et al. (2017); Ghadimi et al. (2018); Zhang and Xiao (2019a, b, c), just to name a few. A key difference between general CSO and Problem (3) is that the latter could be formulated as a composition of two deterministic functions while (1) cannot. As a result, the aforementioned algorithms that hinge upon this special structure, are not directly applicable to solving CSO.
1.3 Notations
Throughout the paper, we use to denote the projection operator, i.e., . We use to represent the rate hiding the logarithmic factors. A function is Lipschitz continuous on if holds for any . A function is Lipschitz smooth on if holds for any . We say a function is convex on if for any
(4) 
Note that , , and correspond to being strongly convex, convex, and weakly convex, respectively.
2 Biased Stochastic Gradient Descent
For simplicity, throughout, we assume that the random functions and are continuously differentiable. Based on the special composition structure of CSO and chain rule, the gradient of in (1) is given by
In general, it is costly and even impossible to construct an unbiased stochastic estimator of the gradient and to apply SGD. Instead, we consider a biased estimator of using one sample and i.i.d. samples from the conditional distribution of in the following form:
Note that is essentially the gradient of the empirical objective . Based on this biased gradient estimator, we propose BSGD, which is formally described in Algorithm 1. When using fixed inner minibatch sizes , BSGD could be treated as the classical SGD on an approximation function . Throughout the paper, we make the following assumptions:
We assume that

is convex.

There exists such that

.

is Lipschitz continuous on for any .
These assumptions are widely used in the convergence analysis of stochastic gradientbased algorithms, where (b) corresponds to the assumption on the bounded second moment of the gradient estimator. We use the following assumption to characterize the (non)convexity of objectives. {assume} and are convex for any .
Strong convexity, namely , can be achieved by adding regularization to convex objectives. On the other spectrum, weak convexity, namely , is a commonly used assumption in nonconvex optimization literature (Paquette et al., 2017; Chen et al., 2019; Davis and Drusvyatskiy, 2019). It is satisfied by a variety of objectives used in machine learning, including some of those involved with neural networks.
Before presenting the main results, we make another observation that the bias of the function estimator on , induced by the composition of functions, depends on the smoothness condition of the outer function . Notice that for Lipschitz smooth , we have
and for Lipschitz continuous , we have
where is a random variable. The following lemma formally characterizes the estimation bias of .
[Hu et al. (2019)] Under Assumption 2, for a sample and i.i.d. samples from the conditional distribution , and any that is independent of and , we have \beq —\EE_{ξ,{η_j}_j=1^m } ^F(x; ξ,{η_j}_j=1^m)  F(x)— ≤Lfσgm. \eeqIf additionally, is Lipschitz smooth, we have \beq —\EE_{ξ,{η_j}_j=1^m } ^F(x; ξ,{η_j}_j=1^m)  F(x)— ≤Sσg22m. \eeq Lemma 2 implies that to control the bias of the estimator up to , number of samples are needed for Lipschitz continuous whereas only number of samples are needed for Lipschitz smooth .
3 Convergence of BSGD
In this section, we begin by analyzing the global convergence of BSGD for strongly convex objectives and convex objectives. Then for the general case of weakly convex objectives, we analyze the convergence to an approximate stationary point. All of these results can be extended to the minibatch setting in which BSGD uses multiple i.i.d. samples of instead of one at each iteration. All proofs are deferred to Appendix A.
3.1 Convergence for Strongly Convex Objectives
For strongly convex objectives , i.e., the functions satisfying Assumption 2 with , we have the following result.
Under Assumptions 2 and 2 with , with stepsizes , the output of BSGD satisfies:
If additionally, is Lipschitz smooth, we have,
Theorem 3.1 implies that to find an optimal solution, the number of iterations is in order of which assents with the performance of SGD with the averaging scheme for strongly convex objectives (Shamir and Zhang, 2013). As for sample complexity, one could either use fixed minibatch sizes , or timevarying minibatch sizes for Lipschitz continuous and for Lipschitz smooth to establish the following result.
Under the same assumptions as Theorem 3.1, to achieve an optimal solution, the total sample complexity of BSGD is . If further assuming Lipschitz smooth , the sample complexity of BSGD is . The above result indicates that smoothness conditions make a difference in the total sample complexity of BSGD when solving CSO. It is worth pointing out that the sample complexity of BSGD matches with the that of ERM for strongly convex objectives esatblished in Hu et al. (2019).
3.2 Convergence for General Convex Objectives
We now turn to the convergence for convex objectives, i.e., these satisfying Assumption 2 with . Denote , where refers to the optimal solution set of over and is the initial point used in BSGD.
Under Assumptions 2 and 2 with , with stepsizes for a positive constant , the output of BSGD satisfies
Furthermore, if is Lipschitz smooth, we have
Comparing to Theorem 3.1, without strong convexity condition, the iteration complexity increases from to . Similarly, by setting the fixed minibatch sizes , or the timevarying sizes for Lipschitz continuous and for Lipschitz smooth , we can obtain the following result.
Under the same assumptions as Theorem 3.2, to achieve an optimal solution, the total sample complexity required by BSGD is . If further assuming Lipschitz smooth , the sample complexity required by BSGD becomes .
3.3 Stationary Convergence for Nonconvex Objectives
In this subsection, we analyze the convergence of BSGD for finding an approximate stationary point when the objective is weakly convex. We first define the Moreau envelope () of the function and its corresponding minimizer:
Based on Moreau envelope, we define the gradient mapping:
We call , generated by a randomized algorithm, an stationary point of if , where the expectation is taken with respect to the randomness in . This convergence criteria is commonly used in nonconvex optimization literature (Beck, 2017; Drusvyatskiy, 2017). We have the following result.
Under Assumptions 2 and 2 with , for a given total iteration number , let stepsizes for a positive constant and inner minibatch sizes for some integer . The output of BSGD, , selected uniformly at random from , satisfies
If we further assume is Lipschitz smooth, we have
To the best of our knowledge, this is the first nonasymptotic convergence guarantee for CSO in the nonconvex setting. {cor} Under the same assumptions as Theorem 3.3, to achieve an stationary point, the total sample complexity required by BSGD is at most . If further assuming Lipschitz smooth , the sample complexity is at most .
Specifically, the total sample complexity is obtained by setting the number of iterations and the inner batch size or in the smooth case. Note that even when we use an infinite number of i.i.d. samples from the distribution , namely, , to construct an unbiased gradient estimator at each iteration, BSGD (in which case, reduces to SGD) would still need iterations to obtain an stationary point in general. This is also known to be the optimal sample complexity of finding an stationary point using unbiased firstorder algorithms for weakly convex objectives (Arjevani et al., 2019; Drori and Shamir, 2019). Thus, the iteration complexity of BSGD is optimal.
We also provide a convergence guarantee using decaying stepsizes. {cor} (Decaying Stepsizes) Let , inner batch size , and stepsizes with . If the output is chosen from where , we have,
(5) 
if we further assume is Lipschitz smooth, we have
(6) 
4 Lower Bounds of BSGD in Convex Setting
In this section, we discuss the lower bounds on the expected error of BSGD using the wellknown oracle model. The oracle model consists of three components: a function class of interests , an algorithm class , and an oracle class in which an oracle that takes a query point from an algorithm about a function and reveals back to the algorithm some information about . Function class of interest, algorithms class, and detailed proof are deferred to Section B in the Appendix.
[Biased Stochastic Function Value Oracle] A biased stochastic function value oracle has two parameters and . For a query at point of function given by an algorithm, takes a sample from distribution , and returns to the algorithm , which satisfies the following conditions:

,

,

.
We use the minimax error as the complexity measure:
(7) 
where is the optimization error of a randomized algorithm for optimizing a function on a bounded convex set via querying oracles . If , it implies that for any algorithm , there exists a ‘hard’ function in and an oracle such that the expected error incurred by when optimizing querying after iterations is at least .
For any algorithm that queries the biased stochastic function value oracle times, the minimax error satisfies the following bounds:

for onedimensional convex and Lipschitz continuous functions class , we have
(8) 
for onedimensional strongly convex and Lipschitz continuous functions class , we have
(9)
A general version of Theorem 4 for any dimension is provided in Appendix B. In the special case when , namely when the oracle returns to the algorithm an unbiased function estimator and its gradient, Theorem 4 reduces to the lower bounds on oracle complexity for classical stochastic convex optimization. The proof follows ideas from Hu et al. (2016) and is deferred to Appendix B. Invoking Lemma 2 and Theorem 4, we have the following lower bounds of BSGD.
When minimizing a CSO objective using BSGD with a fixed minibatch size and any stepsizes for iterations, there always exist samples such that

for convex and Lipschitz continuous ,

for convex and Lipschitz smooth ,

for 1stronglyconvex and Lipschitz continuous ,

for 1strongly convex and Lipschitz smooth ,
Comparing to Corollary 3.2, Corollary 4 implies that the sample complexity of BSGD for convex objectives is optimal and cannot be further improved. For strongly convex objectives, there exists a gap between the lower bound and upper bound. It remains interesting to explore whether the lower bounds can be further improved, which we leave for future investigation.
5 Applications and Numerical Experiments
In this section, we demonstrate the numerical performance of BSGD on invariant logistic regression, modelagnostic metalearning, and instrumental variable regression. The parameter settings and detailed experiment results are deferred to Appendix C. The platform configuration used for the experiment is Intel Core i97940X CPU @ 3.10GHz, 32GB RAM, 64bit Ubuntu 18.04.3 LTS.
5.1 Invariant Logistic Regression
Invariant learning has wide applications in machine learning and related areas (Mroueh et al., 2015; Anselmi et al., 2016). We consider a simple invariant logistic regression problem with the following formulation:
(10) 
where is the model parameter to be trained, is the random feature vector, is the corresponding label, is a random perturbed observation of the feature .
We generate a synthetic dataset with , with , . We consider three different variances of : , corresponding to different perturbation levels. At each iteration, we use a fixed minibatch size , namely samples of are generated for a given feature label pair . We finetune the stepsizes for BSGD using grid search. We compare the performance to the SAA approach (Hu et al., 2019) solved by CVXPY as a benchmark.
For a given total number of samples , the performance of BSGD with different inner batch sizes and under different perturbation levels is summarized in Figure 1. When increasing from to , larger inner batch sizes are needed to control the bias incurred by the biased gradient estimator of BSGD. This numerical observation supports our theoretical findings of the delicate tradeoff between the inner batch size and the number of iterations. We also compare the performance achieved by BSGD and SAA, in terms of
BSGD  SAA  

Mean  Dev  Mean  Dev  
1  5  1.77e04  4.70e05  100  5.56e04  2.81e04 
10  5  3.26e04  1.15e04  464  2.14e03  8.45e04 
100  50  1.50e03  6.97e04  1000  1.12e02  6.42e04 
5.2 ModelAgnostic MetaLearning (MAML)
MAML has received tremendous popularity in the past years for fewshot supervised learning and meta reinforcement learning. Recall the formulation of MAML in (2). It is a special case of CSO (1) using the following mappings:
Remark. (Convergence of BSGD for MAML) If the objective of MAML in (2) is weakly convex and is Lipschitz smooth in for any , namely is Lipschitz continuous and is Lipschitz smooth, then Corollary 3.3 implies BSGD converges to an stationary point of (2) with sample complexity . This complexity, as we have also mentioned in the introduction, matches with the sample complexity of a recent algorithm for MAML discussed in Fallah et al. (2019), which additionally requires stochastic stepsizes and minibatch of tasks (i.e., outer samples) at each iteration.
We take the widely used sinewave fewshot regression task as an illustration. The goal is to train a model such that it can recover a new (unseen) sinewave signal from only a few available data points. The sine wave is of the form where are undetermined parameters and is set to be in the region . In this experiment, we set , , where and is a neural network consisting of hidden layers with nodes and using ReLU activation function between each layers.
For a fixed total number of samples , we select the inner batch size and compare the performance of BSGD with firstorder MAML (FOMAML) (Finn et al., 2017; Fallah et al., 2019), where FOMAML ignores the Hessian information when constructing the gradient estimator. Here we use the objective value as the measurement. Since the objective is analytically intractable, we evaluate the MAML objective via empirical objective obtained via empirical risk minimization:
(11) 
where the three sample sizes , and are set to be 100, when computing the approximate loss function value, the sample tasks/data are selected randomly. We finetune the stepsizes for both BSGD and FOMAML. Also, we compare the performances with Adam, which is applied in the original MAML paper (Finn et al., 2017), as for Adam, we use its default parameter settings in PyTorch for training.
Experiment results are summarized in Figure 2. The average loss and running time (in CPU minutes) over trials of each algorithm (under their best inner batch sizes) are given in Table 3. More detailed experiment results are provided in Table 6 in Section C.2 of the Appendix. Figure 2(a) demonstrates a tradeoff between inner batch size and number of iterations for BSGD. Figure 2(b) compares the convergences of the three algorithms with the besttuned inner batch size, and the red line demonstrates a divergent case of FOMAML when inner batch size is . Although FOMAML requires the least running time, as shown in Table 3, its performance is worse than BSGD. BSGD achieves the least error among the three methods with a proper inner batch size of . BSGD is slightly faster than Adam and requires a smaller minibatch size to achieve its best performance, which is more practical when some tasks only have a small number of samples. We also use the meta initialization parameters obtained by different methods as the initial model parameters to train an unseen sine wave regression task. Figure 2(c) shows the signals recovered after a onestep update on the unseen task with only samples. Here Random NN refers to a neural network with the same structure using a random initialization. We observe that BSGD, FOMAML, Adam could recover most of the shape of the signal while Random NN fails to work without the meta initialization parameter.
BSGD  FOBSGD  Adam  
Mean  CPU  Mean  CPU  Mean  CPU  
10  2.12e01  71.57  2.52e01  41.45  8.16e01  86.54 
20  2.04e01  35.63  2.50e+00  20.60  3.99e01  43.42 
50  2.17e01  14.63  3.98e+00  8.64  2.77e01  17.62 
5.3 Instrumental Variable Regression (IV)
IV is a fundamental task in econometrics, health care, and social science. It aims to estimate the causal effect of input on a confounded outcome with an instrument variable , which, for given , is conditionally independent of , namely . Since an unobservable confounder influences both , confounder, classical regression methods might fail. Recently, Muandet et al. (2019) showed that, instead of using the traditional twostage procedure (Angrist and Pischke, 2008; Hartford et al., 2017), one could resort to solving an optimization problem formulated as:
(12) 
where is a function parameterized by and maps to , is the square loss function.
Similar to Lewis and Syrgkanis (2018); Bennett et al. (2019), we generate data via the following process:
where is the confounder, are random noise. We consider four cases with different ground truth functions: (a) sine function , (b) step function , (c) absolute function , and (d) linear function . We compare the performance of BSGD, Random NN, the famous 2SLS (Angrist and Imbens, 1995), and Poly2SLS. Random NN predicts directly from using a neural network. Poly2SLS expands and via polynomial features, and performed 2SLS via Ridge regression. The neural networks applied in BSGD and Random NN have the same structure such that they both consist of hidden layers with nodes and ReLU activation function between each layer. For Random NN, we use the classical SGD to perform updates.
Figure 4 demonstrates the convergence of BSGD under different inner minibatch sizes (run for five times for each case) on data generated by different ground truth functions. Figure 3 compares the recovered signal using BSGD, Random NN, 2SLS, and Poly2SLS. For BSGD and Random NN, we fine tune the stepsize and select the best the inner minibatch size for BSGD. For 2SLS and Poly2SLS, we use the implementation released in Bennett et al. (2019)
The results shows that 2SLS outperforms all other methods on linear ground truth case since the underlying assumption of 2SLS is linearity. BSGD achieves the least mean squared error on other three cases.
Method  

Mean  Dev  Mean  Dev  Mean  Dev  Mean  Dev  
BSGD  6.21e01  5.93e01  5.54e01  1.85e01  8.78e01  2.43e02  1.06e01  6.98e02 
Random NN  2.07e+00  2.68e01  3.40e+00  3.61e01  2.15e+00  8.83e02  1.57e+00  4.16e+00 
2SLS  4.85e+00  1.24e02  1.81e04  1.63e04  5.22e+00  3.44e02  7.76e01  2.18e02 
Poly2SLS  1.04e+01  3.89e01  2.26e04  3.17e04  1.28e+01  4.26e+00  3.75e+01  4.26e+00 
6 Conclusion
We propose a biased firstorder algorithm, BSGD, for solving CSO problems. We analyze the upper bounds on the sample complexity of BSGD under various structural assumptions. We establish the first nonasymptotic convergence to an approximate stationary point result for nonconvex CSO, which also provides a theoretical guarantee for MAML. We show that the sample complexity of BSGD highly depends on the (non)convexity condition, which influences the iteration complexity, and the smoothness condition, which affects the estimation bias incurred by the biased gradient estimator. We also provide a matching lower bound of BSGD for convex objectives. As for strongly convex and weakly convex objectives, we conjecture that BSGD achieves optimal sample complexities, whereas constructing matching lower bounds remains an interesting and open problem.
Appendix A Convergence Analysis
In this section, we demonstrate the proof of Theorems 3.1, 3.2 and 3.3. We also prove the Corollary 3.1. The proof of Corollaries 3.2 and 3.3 are similar. First we demonstrate the proof framework for strongly convex and convex objectives.
a.1 Proof Framework for Strongly Convex and Convex Objectives
Recall BSGD algorithm 1, at iteration , BSGD first generate samples from distribution and samples from conditional distribution of . We define the following auxiliary functions to facilitate our analysis
Note that . The biased gradient estimator used in BSGD is . Denote , , . Since and that the projection operator is nonexpansive, we have
(13) 
Dividing and taking expectation on both side, we have,
(14) 
By Assumption 2, we have
(15) 
Taking expectation on both side, by the definition of , it holds , then
(16) 
We also notice that and could be bounded by Lemma 2 as and are independent of .