Quadruply Stochastic Gradients for Large Scale Nonlinear
Semi-Supervised AUC Optimization
Semi-supervised learning is pervasive in real-world applications, where only a few labeled data are available and large amounts of instances remain unlabeled. Since AUC is an important model evaluation metric in classification, directly optimizing AUC in semi-supervised learning scenario has drawn much attention in the machine learning community. Recently, it has been shown that one could find an unbiased solution for the semi-supervised AUC maximization problem without knowing the class prior distribution. However, this method is hardly scalable for nonlinear classification problems with kernels. To address this problem, in this paper, we propose a novel scalable quadruply stochastic gradient algorithm (QSG-S2AUC) for nonlinear semi-supervised AUC optimization. In each iteration of the stochastic optimization process, our method randomly samples a positive instance, a negative instance, an unlabeled instance and their random features to compute the gradient and then update the model by using this quadruply stochastic gradient to approach the optimal solution. More importantly, we prove that QSG-S2AUC can converge to the optimal solution in , where is the iteration number. Extensive experimental results on a variety of benchmark datasets show that QSG-S2AUC is far more efficient than the existing state-of-the-art algorithms for semi-supervised AUC maximization, while retaining the similar generalization performance.
Quadruply Stochastic Gradients for Large Scale Nonlinear
Semi-Supervised AUC Optimization
Wanli Shi , Bin Gu , Xiang Li , Xiang Geng and Heng Huang ††thanks: Contact Author
School of Computer & Software, Nanjing University of Information Science & Technology, P.R.China
JD Finance America Corporation
Computer Science Department, University of Western Ontario, Canada
Department of Electrical & Computer Engineering, University of Pittsburgh, USA
email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, email@example.com
Semi-supervised learning addresses the problems where the available data is composed of a small size of labeled samples and a huge size of unlabeled samples. It is of immense practical interest in a wide range of applications, such as image retrieval [?], natural language processing [?] and speech analysis [?]. Since semi-supervised learning requires less human effort and can achieve a better generalization performance, it has attracted a great deal of attention in the machine learning communities, i.e., [?; ?; ?; ?; ?].
The area under the ROC curve (AUC) [?] measures the probability of a randomly drawn positive instance being ranked higher than a randomly drawn negative instance. Thus, AUC is a more effective performance measure than the accuracy in data imbalance binary classification problem. Many studies [?; ?] have also pointed out that optimizing AUC can achieve a better generalization performance than directly optimizing accuracy. Due to the superiority of AUC as mentioned above, a large amount of attention has been attracted to introduce AUC to semi-supervised learning.
Recently, several algorithms have been proposed to address the semi-supervised AUC optimization problem. For instance, to train a classifier, SSRankBoost [?] and OptAG [?] exploited the assumption that two samples share the same label if their distance in a metric space is small. However, this restrictive assumption may not always hold in real-world applications, and could lead to biased solutions. Sakai et al., [?] pointed out that both unlabeled instances and labeled instances follow the same joint probability distribution and the restrictive assumption is not necessary. However, their method PNU-AUC requires to estimate the class prior which is difficult to be obtained when labeled instances are extremely small. Recently, Xie and Li, [?] proposed that neither the class priors nor any other distributional assumption about the unlabeled data are necessary to find the unbiased solution. We summarize these algorithms in Table 1.
|Algorithm||Reference||Function model||Computational complexity||Space complexity|
Nonlinear data structures widely exist in many real-world problems, and kernel method is a typical way to solve such problems [?; ?]. However, this approach can hardly scale to large datasets. Specifically, the kernel matrix needs operations to be calculated and to be stored, where denotes the number of instances and denotes the dimensionality of the data [?]. However, the bottlenecks of the computational complexities become more severe for semi supervised learning because the sample size is always very large in the semi-supervised scenario. Even worse, PNU-AUC and SAMULT [?] need operations to compute the matrix inverse. Thus, scaling up non-linear semi-supervised AUC maximization is a challenging problem.
To scale up kernel-based algorithms, a large amount of methods has been proposed, i.e., asynchronous parallel algorithms [?; ?; ?], kernel approximation [?; ?]. To our knowledge, doubly stochastic gradient (DSG) [?] is the most effective method to scale up kernel-based algorithms. Specifically, DSG samples a random instance and the random features to compute the doubly stochastic gradient which is used to update the model. However, different from the standard DSG, semi-supervised learning has three sources of data, i.e., positive instances, negative instances and unlabeled datasets. In addition, optimizing AUC is a pairwise learning problem which is more complicated than the pointwise learning problem considered in the standard DSG algorithm. Therefore, the existing algorithms and theoretical analysis for DSG cannot be directly applied to non-linear semi-supervised AUC maximization.
To address this challenging problem, we introduce multiple sources of randomness. Specifically, we randomly sample a positive, a negative and an unlabeled instance in each iteration to compose a triplet of data points. Then we use the random features w.r.t these data triplets to compute the stochastic gradient. Since the stochastic gradient would then contain four sources of randomness, we denote our algorithm as quadruply stochastic semi-supervised AUC maximization (QSG-S2AUC). Theoretically, we prove that QSG-S2AUC can converge to the optimal solution at the rate of , where is the number of gradient iterations. Extensive experimental results on a variety of benchmark datasets show that QSG-S2AUC is far more efficient than the existing state-of-the-art algorithms for semi-supervised AUC maximization, while retaining the similar generalization performance.
Contributions. The main contributions of this paper are summarized as follows.
We propose an efficient nonlinear semi-supervised AUC optimization algorithm based on the DSG framework. Since semi-supervised learning contains three sources of data, we employ triplets of data points in each iteration and extend the standard DSG framework.
We prove that QSG-S2AUC has the convergence rate of which is same to the one of standard SGD even though our QSG-S2AUC has four sources of randomness.
2 Related Works
In this section, we give a brief review of kernel approximation and large scale AUC maximization methods respectively.
2.1 Kernel Approximation
Kernel approximation has attracted great amounts of attention to scale up kernel-based learning algorithms. The data-dependent methods, such as greedy basis selection techniques [?], incomplete Cholesky decomposition [?], Nyström method [?], utilize the given training set to compute a low-rank approximation of the kernel matrix. However, they need a large amount of training instances to achieve a better generalization. To handle this challenge, random Fourier feature (RFF) [?] directly approximates the kernel function unbiasedly with some basis functions. However, large amounts of memory are required since the number random features need to be larger than the original features to achieve low approximation error. To further improve RFF, Dai et al., [?] proposed DSG algorithm. It uses pseudo-random number generators to calculate the random features on-the-fly, which highly reduces the memory requirement. These methods have been widely applyed to scale up kernel-based learning algorithms, such as [?; ?].
2.2 Large Scale AUC Optimization
Recently, several efforts have been devoted to scale up the AUC optimization. For example, Ying et al., [?] formulated the AUC optimization as a convex-concave saddle point problem and proposed a stochastic online method (SOLAM) which has the time and space complexities of one datum. FSAUC [?] developed a multi-stage scheme for running primal-dual stochastic gradient method with adaptively changing parameters. FSAUC has the convergence rate of , where is the number of random samples. However, both SOLAM and FSAUC focus on scaling up the linear AUC optimization and are incapable of maximizing AUC in the nonlinear setting. Recently, FOAM and NOAM [?] used RFF and Nyström method, respectively, to scale up the kernel based AUC optimization problem. However, as mentioned above, both methods require large amounts of memory to achieve a better generalization performance and not trivial to scale up the nonlinear semi-supervised AUC optimization problems based.
3.1 Supervised AUC Optimization
In supervised learning, let be a -dimensional pattern and be a class label. Let be the underlying joint density of . The optimization is to train a classifier that maximizes the following function.
where , and . Obviously, maximizing AUC is equivalent to minimizing the following PN AUC risk.
Given the positive and negative datasets as and respectively. Thus, the PN AUC risk can be rewritten as follows.
where and denote the means of and , respectively.
3.2 Semi-Supervised AUC Optimization
Since large amounts of instances remain unlabeled in semi-supervised learning, we assume that the labeled dataset is limited while the unlabeled data can be infinite and has the underlying distribution density of , where and denotes the positive class prior. Recently, Xie and Li, [?] have shown that it is unnecessary to estimate distributional assumptions or class prior to achieve an unbiased solution for semi-supervised AUC optimization. Specifically, PU AUC risk and NU AUC risk are equivalent to the supervised PN AUC risk risk with a linear transformation, where PU AUC risk is estimated by positive and unlabeled data treated as negative data, and NU AUC risk is estimated by negative and unlabeled data treated as positive data. We define and as follows,
where denotes the expectation over the density . PU AUC risk can be written as follows.
where and denotes the positive and negative instances in unlabeled dataset. Similarly, NU AUC risk can be rewritten as
Then PN AUC risk can be formulated as follows.
Thus, the semi-supervised AUC optimization can be formulated as follows.
where is the trade-off parameter. To reduce the risk of overfitting, we add a -regularizer into (8) and have the following objective for semi-supervised AUC optimization.
where is the regularized parameter and denotes the norm in a reproducing kernel Hilbert space (RKHS) .
3.3 Random Fourier Feature
In this section, we give a brief review of RFF. Assume that we have a continuous, real-valued, symmetric and shift-invariant kernel function . According to Bochner Theorem [?], this kernel function is positive definite and has a nonnegative Fourier transform function as , where is a density function associated with . The integrand can be replaced with [?]. Then we can obtain a real-valued feature map , where is randomly sampled according to the density function . We can obtain the feature map for random features of a real-valued kernel as follows.
Obviously, is an unbiased estimate of .
4 Quadruply Stochastic Semi-Supervised AUC Maximization
4.1 Quadruply Stochastic Gradients
Based on the definition of the function , we easily obtain , and . Thus, the gradient of the objective (9) can be written as:
where denotes the derivative of w.r.t. , denotes the derivative of w.r.t. , denotes the derivative of w.r.t. , denotes the derivative of w.r.t. , denotes the derivative of w.r.t. and denotes the derivative of w.r.t. .
In order to update the classifier in a stochastic manner,we randomly sample a positive data point and a negative data point from and , respectively. In addition, we randomly sample an unlabeled data point according to the unlabeled data distribution density . In each iteration, we use a triplet of these data points to compute the stochastic functional gradient of (8) as follows.
We can apply the random Fourier feature method to further approximate the stochastic functional gradient as follows.
Obviously, we have that . Thus, we can achieve the unbiased estimate of the gradient (4.1) by using either or as follows,
Because four randomness (i.e. , , and ) are involved in , we call the functional gradient as quadruply stochastic functional gradient.
Then, we first give the update rule with the stochastic gradient as follow,
where , denotes the step size and denotes the function value if we use gradient . Since is an unbiased estimate of , the update rule using after iterations can be written as follow,
where , and denotes the function value for the input if we use the functional gradient .
In order to implement the update process in computer program, we rewrite the update rule as the following iterative update rules with constantly-changing coefficients ,
4.2 QSG-S2AUC Algorithms
In our implementation, we use pseudo-random number generators with seed to sample random features. In each iteration, we only need to keep the seed aligned between prediction and training. Then the prediction function can be restored much more easily. Besides, the QSG-S2AUC maintains a sequence of at each ieration which has low memory requirement. Specifically, each iteration of the training algorithm executes the following steps.
Select Random Data Triplets: Randomly sample a positive instance, a negative instance and an unlabeled instance to compose a data triplet. In addition, we use mini-batch of these data points to achieve a better efficiency.
Approximate the Kernel Function: Sample with random seed to calculate the random features on-the-fly. We keep this seed aligned between prediction and training to speed up computing .
5 Convergence Analysis
In this section, we prove that QSG-S2AUC converges to the optimal solution at the rate of . We first give several assumptions which are standard in DSG [?].
(Bound of kernel function). The kernel function is bounded, i.e., , where .
(Bound of random feature norm). The random feature norms are bounded, i.e., .
(Lipschitz continuous). The first order derivation of is -Lipschitz continuous in terms of and -Lipschitz continuous in terms of . Similarly, the first order derivation of is -Lipschitz continuous in terms of and -Lipschitz continuous in terms of and the first order derivation of is -Lipschitz continuous in terms of and -Lipschitz continuous in terms of .
(Bound of derivation). There exists , , , , and , such that , , , , , ,
We use the framework of [?] to prove that can converge to the optimal solution . Specifically, we use the aforementioned as an intermediate value to decompose the difference between and as follows,
In other words, the total approximation error includes the error caused by approximating the kernel with random features, and the error caused by sampling random data. Finally, the boundary of the original error can be obtained by summing up the boundary of these two parts.
Lemma 1 (Error due to random features)
Let denotes the whole training set in semi-supervised learning problem. For any , we have
where , and .
Obviously, the upper bound depends on the convergence of , which is given in Lemma 2.
Suppose () and . We have and .
Lemma 3 (Error due to random data)
Set , , such that , we have
where , and .
Theorem 1 (Convergence in expectation)
Let denote the whole training set in semi-supervised learning problem. Set , , such that . , we have
Theorem 1 shows that for any given , the evaluated value of at will converge to that of in terms of the Euclidean distance at the rate of . This rate is the same as that of standard DSG even though our problem is much more complicated and has four sources of randomness.
In this section, we present the experimental results on several datasets to demonstrate the effectiveness and efficiency of QSG-S2AUC.
6.1 Experimental Setup
We compare the AUC results and running time of QSG-S2AUC with the state-of-the-art semi-supervised AUC maximization algorithms as summarized as follows.
PNU-AUC: Unbiased semi-supervised AUC optimization method proposed in [?] based on positive and unlabeled learning.
SAMULT: The method proposed in [?] which does not require the class prior distribution to achieve the unbiased solution.
All the experiments were ran on a PC with 56 2.2GHz cores and 80GB RAM. We implemented QSG-S2AUC and SAMULT algorithms in MATLAB. We used the MATLAB code from https://github.com/t-sakai-kure/PNU as the implementation of PNU-AUC. For all algorithms, we use the square pairwise loss and Gaussian kernel . The hyper-parameters (, and ) are chosen via 5-fold cross-validation. and were searched in the region . The trade-off parameter in SAMULT and QSG-S2AUC was searched from to at intervals of , and that in PNU-AUC was searched from to at intervals of . In addition, the class prior in PNU-AUC is set to the class proportion in the whole training set, which can be estimated by [?]. All the results are the average of 10 trials.
We carry out the experiments on eight large scale benchmark datasets collected from LIBSVM111LIBSVM is available at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/
datasets/binary/. and UCI222UCI is available at http://archive.ics.uci.edu/ml/datasets.html. repositories. The size of the dataset and the feature dimensionality are summarized in Table 2. To conduct the experiments for semi-supervised learning, we randomly sample 200 labeled instances and treat the rest of the data as unlabeled. All the data features are normalized to in advance.
6.3 Results and Discussion
Figure 1 shows the training time of the three algorithms against different sizes of unlabeled samples on the eight benchmark datasets, where the sizes of labeled samples are fixed at 200. We can find that QSG-S2AUC is always faster than SAMULT and PNU-AUC. This is because the SAMULT and PNU-AUC need operations to compute the inverse matrixes with kernel. Differently, QSG-S2AUC uses RFF to approximate the kernel function, and each time it only needs operations to calculate the random features with seed . In addition, the low memory requirement of QSG-S2AUC allows it to do an efficient training for large scale datasets while PNU-AUC and SAMULT are out of memory. Figure 2 presents the testing AUC results of these algorithms on the eight benchmark datasets. The results show that QSG-S2AUC has the similar AUC results with other methods on the most datasets, and has the highest AUC on the datasets of Covtype and Ijcnn1. Based on these results, we conclude that QSG-S2AUC is superior to other state-of-the-art algorithms in terms of efficiency and scalability, while retaining the similar generalization performance.
In this paper, we propose a novel scalable semi-supervised AUC optimization algorithm, QSG-S2AUC. Considering that semi-supervised learning contains three data sources, DSG-S2AUC is designed to randomly sample one instance from each data source in each iteration. Then, their random features are generated and used to calculate a quadruply stochastic functional gradient for model update. Even though this optimization process contains multiple layers of stochastic sampling, theoretically, we prove that QSG-S2AUC has a convergence rate of . The experimental results on various datasets also demonstrate the superiority of the proposed QSG-S2AUC.
H.H. was partially supported by U.S. NSF IIS 1836945, IIS 1836938, DBI 1836866, IIS 1845666, IIS 1852606, IIS 1838627, IIS 1837956. B.G. was partially supported by the National Natural Science Foundation of China (No: 61573191), and the Natural Science Foundation (No. BK20161534), Six talent peaks project (No. XYDXX-042) in Jiangsu Province.
- [Amini et al., 2008] Massih-Reza Amini, Tuong-Vinh Truong, and Cyril Goutte. A boosting algorithm for learning bipartite ranking functions with partially labeled data. 2008.
- [Dai et al., 2014] Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, Maria-Florina F Balcan, and Le Song. Scalable kernel methods via doubly stochastic gradients. In Advances in Neural Information Processing Systems, pages 3041–3049, 2014.
- [Ding et al., 2017] Yi Ding, Chenghao Liu, Peilin Zhao, and Steven CH Hoi. Large scale kernel methods for online auc maximization. In 2017 IEEE International Conference on Data Mining (ICDM), pages 91–100. IEEE, 2017.
- [Drineas and Mahoney, 2005] Petros Drineas and Michael W Mahoney. On the nyström method for approximating a gram matrix for improved kernel-based learning. journal of machine learning research, 6(Dec):2153–2175, 2005.
- [du Plessis et al., 2015] Marthinus Christoffel du Plessis, Gang Niu, and Masashi Sugiyama. Class-prior estimation for learning from positive and unlabeled data. In ACML, pages 221–236, 2015.
- [Fine and Scheinberg, 2001] Shai Fine and Katya Scheinberg. Efficient svm training using low-rank kernel representations. Journal of Machine Learning Research, 2(Dec):243–264, 2001.
- [Fujino and Ueda, 2016] Akinori Fujino and Naonori Ueda. A semi-supervised auc optimization method with generative models. In Data Mining (ICDM), 2016 IEEE 16th International Conference on, pages 883–888. IEEE, 2016.
- [Gao and Zhou, 2015] Wei Gao and Zhi-Hua Zhou. On the consistency of auc pairwise optimization. In IJCAI, pages 939–945, 2015.
- [Gao et al., 2013] Wei Gao, Rong Jin, Shenghuo Zhu, and Zhi-Hua Zhou. One-pass auc optimization. In International Conference on Machine Learning, pages 906–914, 2013.
- [Geng et al., 2019] Xiang Geng, Bin Gu, Xiang Li, Wanli Shi, Guansheng Zheng, and Heng Huang. Scalable semi-supervised svm via triply stochastic gradients. In 28th International Joint Conference on Artificial Intelligence, 2019.
- [Gu and Huo, 2018] Bin Gu and Zhouyuan Huo. Asynchronous doubly stochastic group regularized learning. In International Conference on Artificial Intelligence and Statistics (AISTATS 2018), 2018.
- [Gu et al., 2014] Bin Gu, Victor S Sheng, Keng Yeow Tay, Walter Romano, and Shuo Li. Incremental support vector learning for ordinal regression. IEEE Transactions on Neural networks and learning systems, 26(7):1403–1416, 2014.
- [Gu et al., 2016] Bin Gu, Zhouyuan Huo, and Heng Huang. Asynchronous stochastic block coordinate descent with variance reduction. arXiv preprint arXiv:1610.09447, 2016.
- [Gu et al., 2018a] Bin Gu, Yingying Shan, Xiang Geng, and Guansheng Zheng. Accelerated asynchronous greedy coordinate descent algorithm for svms. In IJCAI, pages 2170–2176, 2018.
- [Gu et al., 2018b] Bin Gu, Miao Xin, Zhouyuan Huo, and Heng Huang. Asynchronous doubly stochastic sparse kernel learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- [Gu et al., 2018c] Bin Gu, Xiao-Tong Yuan, Songcan Chen, and Heng Huang. New incremental learning algorithm for semi-supervised support vector machine. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1475–1484. ACM, 2018.
- [Hanley and McNeil, 1982] James A Hanley and Barbara J McNeil. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology, 143(1):29–36, 1982.
- [Huang et al., 2019] Feihu Huang, Bin Gu, Zhouyuan Huo, Songcan Chen, and Heng Huang. Faster gradient-free proximal stochastic methods for nonconvex nonsmooth optimization. arXiv preprint arXiv:1902.06158, 2019.
- [Li et al., 2017] Xiang Li, Bin Gu, Shuang Ao, Huaimin Wang, and Charles X Ling. Triply stochastic gradients on multiple kernel learning. UAI, 2017.
- [Liang, 2005] Percy Liang. Semi-supervised learning for natural language. PhD thesis, Massachusetts Institute of Technology, 2005.
- [Liu et al., 2018] Mingrui Liu, Xiaoxuan Zhang, Zaiyi Chen, Xiaoyu Wang, and Tianbao Yang. Fast stochastic auc maximization with o (1/n)-convergence rate. In International Conference on Machine Learning, pages 3195–3203, 2018.
- [Rahimi and Recht, 2008] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177–1184, 2008.
- [Rudin, 2017] Walter Rudin. Fourier analysis on groups. Courier Dover Publications, 2017.
- [Sakai et al., 2017] Tomoya Sakai, Marthinus Christoffel du Plessis, Gang Niu, and Masashi Sugiyama. Semi-supervised classification based on classification from positive and unlabeled data. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2998–3006. JMLR. org, 2017.
- [Sakai et al., 2018] Tomoya Sakai, Gang Niu, and Masashi Sugiyama. Semi-supervised auc optimization based on positive-unlabeled learning. Machine Learning, 107(4):767–794, 2018.
- [Sholokhov et al., 2018] Alexey Sholokhov, Md Sahidullah, and Tomi Kinnunen. Semi-supervised speech activity detection with an application to automatic speaker verification. Computer Speech & Language, 47:132–156, 2018.
- [Smola and Schölkopf, 2000] Alex J Smola and Bernhard Schölkopf. Sparse greedy matrix approximation for machine learning. 2000.
- [Wang et al., 2010] Jun Wang, Sanjiv Kumar, and Shih-Fu Chang. Semi-supervised hashing for scalable image retrieval. 2010.
- [Xie and Li, 2018] Zheng Xie and Ming Li. Semi-supervised auc optimization without guessing labels of unlabeled data. 2018.
- [Ying et al., 2016] Yiming Ying, Longyin Wen, and Siwei Lyu. Stochastic online auc maximization. In Advances in neural information processing systems, pages 451–459, 2016.
- [Yu et al., 2019] Shuyang Yu, Bin Gu, Kunpeng Ning, Haiyan Chen, Jian Pei, and Heng Huang. Tackle balancing constraint for incremental semi-supervised support vector learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019.
Appendix A Detailed Proof of Convergence Rate
a.1 Proof of Lemma 1
Here we give the detailed proof of Lemma 1.
We denote . According to the assumption in section 5, have a bound:
Then we obtain the lemma 1. This completes the proof.
a.2 Proof of lemma 2
Here we give detailed proof of Lemma 2.
Obviously, . Then we have
When , we have , so and . When , if , then . If , then . So we get . Therefore, we obtain the lemma 2. This completes the proof.
a.3 Proof of Lemma 3
For convenience, we denote that , , , , and as , , , , and , respectively. In addition, we define the following three different gradient terms,
Note that from our previous definition, we have .
Denote . Then we have