Quadruply Stochastic Gradients for Large Scale Nonlinear
SemiSupervised AUC Optimization
Abstract
Semisupervised learning is pervasive in realworld applications, where only a few labeled data are available and large amounts of instances remain unlabeled. Since AUC is an important model evaluation metric in classification, directly optimizing AUC in semisupervised learning scenario has drawn much attention in the machine learning community. Recently, it has been shown that one could find an unbiased solution for the semisupervised AUC maximization problem without knowing the class prior distribution. However, this method is hardly scalable for nonlinear classification problems with kernels. To address this problem, in this paper, we propose a novel scalable quadruply stochastic gradient algorithm (QSGS2AUC) for nonlinear semisupervised AUC optimization. In each iteration of the stochastic optimization process, our method randomly samples a positive instance, a negative instance, an unlabeled instance and their random features to compute the gradient and then update the model by using this quadruply stochastic gradient to approach the optimal solution. More importantly, we prove that QSGS2AUC can converge to the optimal solution in , where is the iteration number. Extensive experimental results on a variety of benchmark datasets show that QSGS2AUC is far more efficient than the existing stateoftheart algorithms for semisupervised AUC maximization, while retaining the similar generalization performance.
Quadruply Stochastic Gradients for Large Scale Nonlinear
SemiSupervised AUC Optimization
Wanli Shi , Bin Gu , Xiang Li , Xiang Geng and Heng Huang ^{†}^{†}thanks: Contact Author
School of Computer & Software, Nanjing University of Information Science & Technology, P.R.China
JD Finance America Corporation
Computer Science Department, University of Western Ontario, Canada
Department of Electrical & Computer Engineering, University of Pittsburgh, USA
wanlishi@nuist.edu.cn, jsgubin@gmail.com, lxiang2@uwo.ca, gengxiang@nuist.edu.cn, heng.huang@pitt.edu
1 Introduction
Semisupervised learning addresses the problems where the available data is composed of a small size of labeled samples and a huge size of unlabeled samples. It is of immense practical interest in a wide range of applications, such as image retrieval [?], natural language processing [?] and speech analysis [?]. Since semisupervised learning requires less human effort and can achieve a better generalization performance, it has attracted a great deal of attention in the machine learning communities, i.e., [?; ?; ?; ?; ?].
The area under the ROC curve (AUC) [?] measures the probability of a randomly drawn positive instance being ranked higher than a randomly drawn negative instance. Thus, AUC is a more effective performance measure than the accuracy in data imbalance binary classification problem. Many studies [?; ?] have also pointed out that optimizing AUC can achieve a better generalization performance than directly optimizing accuracy. Due to the superiority of AUC as mentioned above, a large amount of attention has been attracted to introduce AUC to semisupervised learning.
Recently, several algorithms have been proposed to address the semisupervised AUC optimization problem. For instance, to train a classifier, SSRankBoost [?] and OptAG [?] exploited the assumption that two samples share the same label if their distance in a metric space is small. However, this restrictive assumption may not always hold in realworld applications, and could lead to biased solutions. Sakai et al., [?] pointed out that both unlabeled instances and labeled instances follow the same joint probability distribution and the restrictive assumption is not necessary. However, their method PNUAUC requires to estimate the class prior which is difficult to be obtained when labeled instances are extremely small. Recently, Xie and Li, [?] proposed that neither the class priors nor any other distributional assumption about the unlabeled data are necessary to find the unbiased solution. We summarize these algorithms in Table 1.
Algorithm  Reference  Function model  Computational complexity  Space complexity 

SSRankBoost  [?]  Nonlinear model  —  
OptAG  [?]  Linear model  —  
PNUAUC  [?]  Nonlinear model  
SAMULT  [?]  Nonlinear model  
QSGS2AUC  Ours  Nonlinear model 
Nonlinear data structures widely exist in many realworld problems, and kernel method is a typical way to solve such problems [?; ?]. However, this approach can hardly scale to large datasets. Specifically, the kernel matrix needs operations to be calculated and to be stored, where denotes the number of instances and denotes the dimensionality of the data [?]. However, the bottlenecks of the computational complexities become more severe for semi supervised learning because the sample size is always very large in the semisupervised scenario. Even worse, PNUAUC and SAMULT [?] need operations to compute the matrix inverse. Thus, scaling up nonlinear semisupervised AUC maximization is a challenging problem.
To scale up kernelbased algorithms, a large amount of methods has been proposed, i.e., asynchronous parallel algorithms [?; ?; ?], kernel approximation [?; ?]. To our knowledge, doubly stochastic gradient (DSG) [?] is the most effective method to scale up kernelbased algorithms. Specifically, DSG samples a random instance and the random features to compute the doubly stochastic gradient which is used to update the model. However, different from the standard DSG, semisupervised learning has three sources of data, i.e., positive instances, negative instances and unlabeled datasets. In addition, optimizing AUC is a pairwise learning problem which is more complicated than the pointwise learning problem considered in the standard DSG algorithm. Therefore, the existing algorithms and theoretical analysis for DSG cannot be directly applied to nonlinear semisupervised AUC maximization.
To address this challenging problem, we introduce multiple sources of randomness. Specifically, we randomly sample a positive, a negative and an unlabeled instance in each iteration to compose a triplet of data points. Then we use the random features w.r.t these data triplets to compute the stochastic gradient. Since the stochastic gradient would then contain four sources of randomness, we denote our algorithm as quadruply stochastic semisupervised AUC maximization (QSGS2AUC). Theoretically, we prove that QSGS2AUC can converge to the optimal solution at the rate of , where is the number of gradient iterations. Extensive experimental results on a variety of benchmark datasets show that QSGS2AUC is far more efficient than the existing stateoftheart algorithms for semisupervised AUC maximization, while retaining the similar generalization performance.
Contributions. The main contributions of this paper are summarized as follows.

We propose an efficient nonlinear semisupervised AUC optimization algorithm based on the DSG framework. Since semisupervised learning contains three sources of data, we employ triplets of data points in each iteration and extend the standard DSG framework.

We prove that QSGS2AUC has the convergence rate of which is same to the one of standard SGD even though our QSGS2AUC has four sources of randomness.
2 Related Works
In this section, we give a brief review of kernel approximation and large scale AUC maximization methods respectively.
2.1 Kernel Approximation
Kernel approximation has attracted great amounts of attention to scale up kernelbased learning algorithms. The datadependent methods, such as greedy basis selection techniques [?], incomplete Cholesky decomposition [?], Nyström method [?], utilize the given training set to compute a lowrank approximation of the kernel matrix. However, they need a large amount of training instances to achieve a better generalization. To handle this challenge, random Fourier feature (RFF) [?] directly approximates the kernel function unbiasedly with some basis functions. However, large amounts of memory are required since the number random features need to be larger than the original features to achieve low approximation error. To further improve RFF, Dai et al., [?] proposed DSG algorithm. It uses pseudorandom number generators to calculate the random features onthefly, which highly reduces the memory requirement. These methods have been widely applyed to scale up kernelbased learning algorithms, such as [?; ?].
2.2 Large Scale AUC Optimization
Recently, several efforts have been devoted to scale up the AUC optimization. For example, Ying et al., [?] formulated the AUC optimization as a convexconcave saddle point problem and proposed a stochastic online method (SOLAM) which has the time and space complexities of one datum. FSAUC [?] developed a multistage scheme for running primaldual stochastic gradient method with adaptively changing parameters. FSAUC has the convergence rate of , where is the number of random samples. However, both SOLAM and FSAUC focus on scaling up the linear AUC optimization and are incapable of maximizing AUC in the nonlinear setting. Recently, FOAM and NOAM [?] used RFF and Nyström method, respectively, to scale up the kernel based AUC optimization problem. However, as mentioned above, both methods require large amounts of memory to achieve a better generalization performance and not trivial to scale up the nonlinear semisupervised AUC optimization problems based.
3 Preliminaries
3.1 Supervised AUC Optimization
In supervised learning, let be a dimensional pattern and be a class label. Let be the underlying joint density of . The optimization is to train a classifier that maximizes the following function.
where , and . Obviously, maximizing AUC is equivalent to minimizing the following PN AUC risk.
(1) 
Given the positive and negative datasets as and respectively. Thus, the PN AUC risk can be rewritten as follows.
(2) 
where and denote the means of and , respectively.
3.2 SemiSupervised AUC Optimization
Since large amounts of instances remain unlabeled in semisupervised learning, we assume that the labeled dataset is limited while the unlabeled data can be infinite and has the underlying distribution density of , where and denotes the positive class prior. Recently, Xie and Li, [?] have shown that it is unnecessary to estimate distributional assumptions or class prior to achieve an unbiased solution for semisupervised AUC optimization. Specifically, PU AUC risk and NU AUC risk are equivalent to the supervised PN AUC risk risk with a linear transformation, where PU AUC risk is estimated by positive and unlabeled data treated as negative data, and NU AUC risk is estimated by negative and unlabeled data treated as positive data. We define and as follows,
(3)  
(4) 
where denotes the expectation over the density . PU AUC risk can be written as follows.
(5) 
where and denotes the positive and negative instances in unlabeled dataset. Similarly, NU AUC risk can be rewritten as
(6) 
Then PN AUC risk can be formulated as follows.
(7) 
Thus, the semisupervised AUC optimization can be formulated as follows.
(8) 
where is the tradeoff parameter. To reduce the risk of overfitting, we add a regularizer into (8) and have the following objective for semisupervised AUC optimization.
(9) 
where is the regularized parameter and denotes the norm in a reproducing kernel Hilbert space (RKHS) .
3.3 Random Fourier Feature
In this section, we give a brief review of RFF. Assume that we have a continuous, realvalued, symmetric and shiftinvariant kernel function . According to Bochner Theorem [?], this kernel function is positive definite and has a nonnegative Fourier transform function as , where is a density function associated with . The integrand can be replaced with [?]. Then we can obtain a realvalued feature map , where is randomly sampled according to the density function . We can obtain the feature map for random features of a realvalued kernel as follows.
(10) 
Obviously, is an unbiased estimate of .
4 Quadruply Stochastic SemiSupervised AUC Maximization
4.1 Quadruply Stochastic Gradients
Based on the definition of the function , we easily obtain , and . Thus, the gradient of the objective (9) can be written as:
(11) 
where denotes the derivative of w.r.t. , denotes the derivative of w.r.t. , denotes the derivative of w.r.t. , denotes the derivative of w.r.t. , denotes the derivative of w.r.t. and denotes the derivative of w.r.t. .
In order to update the classifier in a stochastic manner,we randomly sample a positive data point and a negative data point from and , respectively. In addition, we randomly sample an unlabeled data point according to the unlabeled data distribution density . In each iteration, we use a triplet of these data points to compute the stochastic functional gradient of (8) as follows.
(12) 
We can apply the random Fourier feature method to further approximate the stochastic functional gradient as follows.
(13) 
Obviously, we have that . Thus, we can achieve the unbiased estimate of the gradient (4.1) by using either or as follows,
Because four randomness (i.e. , , and ) are involved in , we call the functional gradient as quadruply stochastic functional gradient.
Then, we first give the update rule with the stochastic gradient as follow,
where , denotes the step size and denotes the function value if we use gradient . Since is an unbiased estimate of , the update rule using after iterations can be written as follow,
where , and denotes the function value for the input if we use the functional gradient .
In order to implement the update process in computer program, we rewrite the update rule as the following iterative update rules with constantlychanging coefficients ,
(14) 
(15)  
(16) 
4.2 QSGS2AUC Algorithms
In our implementation, we use pseudorandom number generators with seed to sample random features. In each iteration, we only need to keep the seed aligned between prediction and training. Then the prediction function can be restored much more easily. Besides, the QSGS2AUC maintains a sequence of at each ieration which has low memory requirement. Specifically, each iteration of the training algorithm executes the following steps.

Select Random Data Triplets: Randomly sample a positive instance, a negative instance and an unlabeled instance to compose a data triplet. In addition, we use minibatch of these data points to achieve a better efficiency.

Approximate the Kernel Function: Sample with random seed to calculate the random features onthefly. We keep this seed aligned between prediction and training to speed up computing .
We summarize the algorithms for training and prediction in Algorithm 1 and 2 respectively.
5 Convergence Analysis
In this section, we prove that QSGS2AUC converges to the optimal solution at the rate of . We first give several assumptions which are standard in DSG [?].
Assumption 1
(Bound of kernel function). The kernel function is bounded, i.e., , where .
Assumption 2
(Bound of random feature norm). The random feature norms are bounded, i.e., .
Assumption 3
(Lipschitz continuous). The first order derivation of is Lipschitz continuous in terms of and Lipschitz continuous in terms of . Similarly, the first order derivation of is Lipschitz continuous in terms of and Lipschitz continuous in terms of and the first order derivation of is Lipschitz continuous in terms of and Lipschitz continuous in terms of .
Assumption 4
(Bound of derivation). There exists , , , , and , such that , , , , , ,
We use the framework of [?] to prove that can converge to the optimal solution . Specifically, we use the aforementioned as an intermediate value to decompose the difference between and as follows,
(17)  
In other words, the total approximation error includes the error caused by approximating the kernel with random features, and the error caused by sampling random data. Finally, the boundary of the original error can be obtained by summing up the boundary of these two parts.
We first give the convergence of error due to random features and random data in Lemmas 1 and 3 respectively. All the detailed proofs are provided in our Appendix.
Lemma 1 (Error due to random features)
Let denotes the whole training set in semisupervised learning problem. For any , we have
(18) 
where , and .
Obviously, the upper bound depends on the convergence of , which is given in Lemma 2.
Lemma 2
Suppose () and . We have and .
Dataset  Features  Samples  Source 

Codrna  8  59,535  LIBSVM 
Ijcnn1  22  49,990  LIBSVM 
Susy  18  5,000,000  LIBSVM 
Covtype  54  581,012  LIBSVM 
Higgs  28  1,100,000  LIBSVM 
Skin  3  245,057  LIBSVM 
Dota2  116  92650  UCI 
Unclonable  129  6,000,000  UCI 
Remark 1
Lemma 3 (Error due to random data)
Set , , such that , we have
(19) 
where , and .
Theorem 1 (Convergence in expectation)
Let denote the whole training set in semisupervised learning problem. Set , , such that . , we have
where .
Remark 2
Theorem 1 shows that for any given , the evaluated value of at will converge to that of in terms of the Euclidean distance at the rate of . This rate is the same as that of standard DSG even though our problem is much more complicated and has four sources of randomness.
6 Experiments
In this section, we present the experimental results on several datasets to demonstrate the effectiveness and efficiency of QSGS2AUC.
6.1 Experimental Setup
We compare the AUC results and running time of QSGS2AUC with the stateoftheart semisupervised AUC maximization algorithms as summarized as follows.

PNUAUC: Unbiased semisupervised AUC optimization method proposed in [?] based on positive and unlabeled learning.

SAMULT: The method proposed in [?] which does not require the class prior distribution to achieve the unbiased solution.
All the experiments were ran on a PC with 56 2.2GHz cores and 80GB RAM. We implemented QSGS2AUC and SAMULT algorithms in MATLAB. We used the MATLAB code from https://github.com/tsakaikure/PNU as the implementation of PNUAUC. For all algorithms, we use the square pairwise loss and Gaussian kernel . The hyperparameters (, and ) are chosen via 5fold crossvalidation. and were searched in the region . The tradeoff parameter in SAMULT and QSGS2AUC was searched from to at intervals of , and that in PNUAUC was searched from to at intervals of . In addition, the class prior in PNUAUC is set to the class proportion in the whole training set, which can be estimated by [?]. All the results are the average of 10 trials.
6.2 Datasets
We carry out the experiments on eight large scale benchmark datasets collected from LIBSVM^{1}^{1}1LIBSVM is available at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/
datasets/binary/. and UCI^{2}^{2}2UCI is available at http://archive.ics.uci.edu/ml/datasets.html. repositories. The size of the dataset and the feature dimensionality are summarized in Table 2. To conduct the experiments for semisupervised learning, we randomly sample 200 labeled instances and treat the rest of the data as unlabeled. All the data features are normalized to in advance.
6.3 Results and Discussion
Figure 1 shows the training time of the three algorithms against different sizes of unlabeled samples on the eight benchmark datasets, where the sizes of labeled samples are fixed at 200. We can find that QSGS2AUC is always faster than SAMULT and PNUAUC. This is because the SAMULT and PNUAUC need operations to compute the inverse matrixes with kernel. Differently, QSGS2AUC uses RFF to approximate the kernel function, and each time it only needs operations to calculate the random features with seed . In addition, the low memory requirement of QSGS2AUC allows it to do an efficient training for large scale datasets while PNUAUC and SAMULT are out of memory. Figure 2 presents the testing AUC results of these algorithms on the eight benchmark datasets. The results show that QSGS2AUC has the similar AUC results with other methods on the most datasets, and has the highest AUC on the datasets of Covtype and Ijcnn1. Based on these results, we conclude that QSGS2AUC is superior to other stateoftheart algorithms in terms of efficiency and scalability, while retaining the similar generalization performance.
7 Conclusion
In this paper, we propose a novel scalable semisupervised AUC optimization algorithm, QSGS2AUC. Considering that semisupervised learning contains three data sources, DSGS2AUC is designed to randomly sample one instance from each data source in each iteration. Then, their random features are generated and used to calculate a quadruply stochastic functional gradient for model update. Even though this optimization process contains multiple layers of stochastic sampling, theoretically, we prove that QSGS2AUC has a convergence rate of . The experimental results on various datasets also demonstrate the superiority of the proposed QSGS2AUC.
Acknowledgments
H.H. was partially supported by U.S. NSF IIS 1836945, IIS 1836938, DBI 1836866, IIS 1845666, IIS 1852606, IIS 1838627, IIS 1837956. B.G. was partially supported by the National Natural Science Foundation of China (No: 61573191), and the Natural Science Foundation (No. BK20161534), Six talent peaks project (No. XYDXX042) in Jiangsu Province.
References
 [Amini et al., 2008] MassihReza Amini, TuongVinh Truong, and Cyril Goutte. A boosting algorithm for learning bipartite ranking functions with partially labeled data. 2008.
 [Dai et al., 2014] Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, MariaFlorina F Balcan, and Le Song. Scalable kernel methods via doubly stochastic gradients. In Advances in Neural Information Processing Systems, pages 3041–3049, 2014.
 [Ding et al., 2017] Yi Ding, Chenghao Liu, Peilin Zhao, and Steven CH Hoi. Large scale kernel methods for online auc maximization. In 2017 IEEE International Conference on Data Mining (ICDM), pages 91–100. IEEE, 2017.
 [Drineas and Mahoney, 2005] Petros Drineas and Michael W Mahoney. On the nyström method for approximating a gram matrix for improved kernelbased learning. journal of machine learning research, 6(Dec):2153–2175, 2005.
 [du Plessis et al., 2015] Marthinus Christoffel du Plessis, Gang Niu, and Masashi Sugiyama. Classprior estimation for learning from positive and unlabeled data. In ACML, pages 221–236, 2015.
 [Fine and Scheinberg, 2001] Shai Fine and Katya Scheinberg. Efficient svm training using lowrank kernel representations. Journal of Machine Learning Research, 2(Dec):243–264, 2001.
 [Fujino and Ueda, 2016] Akinori Fujino and Naonori Ueda. A semisupervised auc optimization method with generative models. In Data Mining (ICDM), 2016 IEEE 16th International Conference on, pages 883–888. IEEE, 2016.
 [Gao and Zhou, 2015] Wei Gao and ZhiHua Zhou. On the consistency of auc pairwise optimization. In IJCAI, pages 939–945, 2015.
 [Gao et al., 2013] Wei Gao, Rong Jin, Shenghuo Zhu, and ZhiHua Zhou. Onepass auc optimization. In International Conference on Machine Learning, pages 906–914, 2013.
 [Geng et al., 2019] Xiang Geng, Bin Gu, Xiang Li, Wanli Shi, Guansheng Zheng, and Heng Huang. Scalable semisupervised svm via triply stochastic gradients. In 28th International Joint Conference on Artificial Intelligence, 2019.
 [Gu and Huo, 2018] Bin Gu and Zhouyuan Huo. Asynchronous doubly stochastic group regularized learning. In International Conference on Artificial Intelligence and Statistics (AISTATS 2018), 2018.
 [Gu et al., 2014] Bin Gu, Victor S Sheng, Keng Yeow Tay, Walter Romano, and Shuo Li. Incremental support vector learning for ordinal regression. IEEE Transactions on Neural networks and learning systems, 26(7):1403–1416, 2014.
 [Gu et al., 2016] Bin Gu, Zhouyuan Huo, and Heng Huang. Asynchronous stochastic block coordinate descent with variance reduction. arXiv preprint arXiv:1610.09447, 2016.
 [Gu et al., 2018a] Bin Gu, Yingying Shan, Xiang Geng, and Guansheng Zheng. Accelerated asynchronous greedy coordinate descent algorithm for svms. In IJCAI, pages 2170–2176, 2018.
 [Gu et al., 2018b] Bin Gu, Miao Xin, Zhouyuan Huo, and Heng Huang. Asynchronous doubly stochastic sparse kernel learning. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 [Gu et al., 2018c] Bin Gu, XiaoTong Yuan, Songcan Chen, and Heng Huang. New incremental learning algorithm for semisupervised support vector machine. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1475–1484. ACM, 2018.
 [Hanley and McNeil, 1982] James A Hanley and Barbara J McNeil. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology, 143(1):29–36, 1982.
 [Huang et al., 2019] Feihu Huang, Bin Gu, Zhouyuan Huo, Songcan Chen, and Heng Huang. Faster gradientfree proximal stochastic methods for nonconvex nonsmooth optimization. arXiv preprint arXiv:1902.06158, 2019.
 [Li et al., 2017] Xiang Li, Bin Gu, Shuang Ao, Huaimin Wang, and Charles X Ling. Triply stochastic gradients on multiple kernel learning. UAI, 2017.
 [Liang, 2005] Percy Liang. Semisupervised learning for natural language. PhD thesis, Massachusetts Institute of Technology, 2005.
 [Liu et al., 2018] Mingrui Liu, Xiaoxuan Zhang, Zaiyi Chen, Xiaoyu Wang, and Tianbao Yang. Fast stochastic auc maximization with o (1/n)convergence rate. In International Conference on Machine Learning, pages 3195–3203, 2018.
 [Rahimi and Recht, 2008] Ali Rahimi and Benjamin Recht. Random features for largescale kernel machines. In Advances in neural information processing systems, pages 1177–1184, 2008.
 [Rudin, 2017] Walter Rudin. Fourier analysis on groups. Courier Dover Publications, 2017.
 [Sakai et al., 2017] Tomoya Sakai, Marthinus Christoffel du Plessis, Gang Niu, and Masashi Sugiyama. Semisupervised classification based on classification from positive and unlabeled data. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 2998–3006. JMLR. org, 2017.
 [Sakai et al., 2018] Tomoya Sakai, Gang Niu, and Masashi Sugiyama. Semisupervised auc optimization based on positiveunlabeled learning. Machine Learning, 107(4):767–794, 2018.
 [Sholokhov et al., 2018] Alexey Sholokhov, Md Sahidullah, and Tomi Kinnunen. Semisupervised speech activity detection with an application to automatic speaker verification. Computer Speech & Language, 47:132–156, 2018.
 [Smola and Schölkopf, 2000] Alex J Smola and Bernhard Schölkopf. Sparse greedy matrix approximation for machine learning. 2000.
 [Wang et al., 2010] Jun Wang, Sanjiv Kumar, and ShihFu Chang. Semisupervised hashing for scalable image retrieval. 2010.
 [Xie and Li, 2018] Zheng Xie and Ming Li. Semisupervised auc optimization without guessing labels of unlabeled data. 2018.
 [Ying et al., 2016] Yiming Ying, Longyin Wen, and Siwei Lyu. Stochastic online auc maximization. In Advances in neural information processing systems, pages 451–459, 2016.
 [Yu et al., 2019] Shuyang Yu, Bin Gu, Kunpeng Ning, Haiyan Chen, Jian Pei, and Heng Huang. Tackle balancing constraint for incremental semisupervised support vector learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019.
Appendix A Detailed Proof of Convergence Rate
a.1 Proof of Lemma 1
Here we give the detailed proof of Lemma 1.
Proof 1
We denote . According to the assumption in section 5, have a bound:
Then we obtain the lemma 1. This completes the proof.
a.2 Proof of lemma 2
Here we give detailed proof of Lemma 2.
Proof 2
Obviously, . Then we have
When , we have , so and . When , if , then . If , then . So we get . Therefore, we obtain the lemma 2. This completes the proof.
a.3 Proof of Lemma 3
For convenience, we denote that , , , , and as , , , , and , respectively. In addition, we define the following three different gradient terms,
Note that from our previous definition, we have .
Denote . Then we have