Adaptive Subgradient Methods for Online AUC Maximization
Learning for maximizing AUC performance is an important research problem in Machine Learning and Artificial Intelligence. Unlike traditional batch learning methods for maximizing AUC which often suffer from poor scalability, recent years have witnessed some emerging studies that attempt to maximize AUC by single-pass online learning approaches. Despite their encouraging results reported, the existing online AUC maximization algorithms often adopt simple online gradient descent approaches that fail to exploit the geometrical knowledge of the data observed during the online learning process, and thus could suffer from relatively larger regret. To address the above limitation, in this work, we explore a novel algorithm of Adaptive Online AUC Maximization (AdaOAM) which employs an adaptive gradient method that exploits the knowledge of historical gradients to perform more informative online learning. The new adaptive updating strategy of the AdaOAM is less sensitive to the parameter settings and maintains the same time complexity as previous non-adaptive counterparts. Additionally, we extend the algorithm to handle high-dimensional sparse data (SAdaOAM) and address sparsity in the solution by performing lazy gradient updating. We analyze the theoretical bounds and evaluate their empirical performance on various types of data sets. The encouraging empirical results obtained clearly highlighted the effectiveness and efficiency of the proposed algorithms.
AUC (Area Under ROC curve)  is an important measure for characterizing machine learning performances in many real-world applications, such as ranking, and anomaly detection tasks, especially when misclassification costs are unknown. In general, AUC measures the probability for a randomly drawn positive instance to have a higher decision value than a randomly sample negative instance. Many efforts have been devoted recently to developing efficient AUC optimization algorithms for both batch and online learning tasks [2, 3, 4, 5, 6, 7].
Due to its high efficiency and scalability in real-world applications, online AUC optimization for streaming data has been actively studied in the research community in recent years. The key challenge for AUC optimization in online setting is that AUC is a metric represented by the sum of pairwise losses between instances from different classes, which makes conventional online learning algorithms unsuitable for direct use in many real world scenarios. To address this challenge, two core types of Online AUC Maximization (OAM) frameworks have been proposed recently. The first framework is based on the idea of buffer sampling [6, 8], which stores some randomly sampled historical examples in a buffer to represent the observed data for calculating the pairwise loss functions. The other framework focuses on one-pass AUC optimization , where the algorithm scan through the training data only once. The benefit of one-pass AUC optimization lies in the use of squared loss to represent the AUC loss function while providing proofs on its consistency with the AUC measure .
Although these algorithms have been shown to be capable of achieving fairly good AUC performances, they share a common trait of employing the online gradient descent technique, which fail to take advantage of the geometrical property of the data observed from the online learning process, while recent studies have shown the importance of exploiting this information for online optimization . To overcome the limitation of the existing works, we propose a novel framework of Adaptive Online AUC maximization (AdaOAM), which considers the adaptive gradient optimization technique for exploiting the geometric property of the observed data to accelerate online AUC maximization tasks. Specifically, the technique is motivated by a simple intuition, that is, the frequently occurring features in online learning process should be assigned with low learning rates while the rarely occurring features should be given high learning rates. To achieve this purpose, we propose the AdaOAM algorithm by adopting the adaptive gradient updating framework proposed by  to control the learning rates for different features. We theoretically prove that the regret bound of the proposed algorithm is better than those of the existing non-adaptive algorithms. We also empirically compared the proposed algorithm with several state-of-the-art online AUC optimization algorithms on both benchmark datasets and real-world online anomaly detection datasets. The promising results validate the effectiveness and efficiency of the proposed AdaOAM.
To further handle high-dimensional sparse tasks in practice, we investigate an extension of the AdaOAM method, which is labeled here as the Sparse AdaOAM method (SAdaOAM). The motivation is that because the regular AdaOAM algorithm assumes every feature is relevant and thus most of the weights for corresponding features are often non-zero, which leads to redundancy and low efficiency when rare features are informative for high dimension tasks in practice. To make AdaOAM more suitable for such cases, the SAdaOAM algorithm is proposed by inducing sparsity in the learning weights using adaptive proximal online gradient descent. To the best of our knowledge, this is the first effort to address the problem of keeping the online model sparse in online AUC maximization task. Moreover, we have theoretically analyzed this algorithm, and empirically evaluated it on an extensive set of real-world public datasets, compared with several state-of-the-art online AUC maximization algorithms. Promising results have been obtained that validate the effectiveness and efficacy of the proposed SAdaOAM.
The rest of this paper is organized as follows. We first review the related works from three core areas: online learning, AUC maximization, and sparse online learning, respectively. Then, we present the formulations of the proposed approaches for handling both regular and high-dimensional sparse data, and their theoretical analysis; we further show and discuss the comprehensive experimental results, the sensitivity of the parameters, and tradeoffs between the level of sparsity and AUC performances. Finally, we conclude the paper with a brief summary of the present work.
2 Related Work
Our work is closely related to three topics in the context of machine learning, namely, online learning, AUC maximization, and sparse online learning. Below we briefly review some of the important related work in these areas.
Online Learning. Online learning has been extensively studied in the machine learning communities [11, 12, 13, 14, 15], mainly due to its high efficiency and scalability to large-scale learning tasks. Different from conventional batch learning methods that assume all training instances are available prior to the learning phase, online learning considers one instance each time to update the model sequentially and iteratively. Therefore, online learning is ideally appropriate for tasks in which data arrives sequentially. A number of first-order algorithms have been proposed including the well-known Perceptron algorithm  and the Passive-Aggressive (PA) algorithm . Although the PA introduces the concept of “maximum margin” for classification, it fails to control the direction and scale of parameter updates during online learning phase. In order to address this issue, recent years have witnessed some second-order online learning algorithms [17, 18, 19, 20], which apply parameter confidence information to improve online learning performance. Further, in order to solve the cost-sensitive classification tasks on-the-fly, online learning researchers have also proposed a few novel online learning algorithms to directly optimize some more meaningful cost-sensitive metrics [21, 22, 23].
AUC Maximization. AUC (Area Under ROC curve) is an important performance measure that has been widely used in imbalanced data distribution classification. The ROC curve explains the rate of the true positive against the false positive at various range of threshold. Thus, AUC represents the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. Recently, many algorithms have been developed to optimize AUC directly [2, 3, 4, 6, 7]. In , the author firstly presented a general framework for optimizing multivariate nonlinear performance measures such as the AUC, F1, etc. in a batch mode. Online learning algorithms for AUC maximization involving large-scale applications have also been studied. Among the online AUC maximization approaches, two core online AUC optimization frameworks have been proposed very recently. The first framework is based on the idea of buffer sampling [6, 8], which employed a fixed-size buffer to represent the observed data for calculating the pairwise loss functions. A representative study is available in , which leveraged the reservoir sampling technique to represent the observed data instances by a fixed-size buffer where notable theoretical and empirical results have been reported. Then,  studied the improved generalization capability of online learning algorithms for pairwise loss functions with the framework of buffer sampling. The main contribution of their work is the introduction of the stream subsampling with replacement as the buffer update strategy. The other framework which takes a different perspective was presented by . They extended the previous online AUC maximization framework with a regression-based one-pass learning mode, and achieved solid regret bounds by considering square loss for the AUC optimization task due to its theoretical consistency with AUC.
Sparse Online Learning. The high dimensionality and high sparsity are two important issues for large-scale machine learning tasks. Many previous efforts have been devoted to tackling these issues in the batch setting, but they usually suffer from poor scalability when dealing with big data. Recent years have witnessed extensive research studies on sparse online learning [24, 25, 26, 27], which aim to learn sparse classifiers by limiting the number of active features. There are two core categories of methods for sparse online learning. The representative work of the first type follows the general framework of subgradient descent with truncation. Taking the FOBOS algorithm  as an example, which is based on the Forward-Backward Splitting method to solve the sparse online learning problem by alternating between two phases: (i) an unconstraint stochastic subgradient descent step with respect to the loss function, and (ii) an instantaneous optimization for a tradeoff between keeping close proximity to the result of the first step and minimizing regularization term. Following this strategy,  argues that the truncation at each step is too aggressive and thus proposes the Truncated Gradient (TG) method, which alleviates the updates by truncating the coefficients at every K steps when they are lower than a predefined threshold. The second category of methods are mainly motivated by the dual averaging method . The most popular method in this category is the Regularized Dual Averaging (RDA) , which solves the optimization problem by using the running average of all past subgradients of the loss functions and the whole regularization term instead of the subgradient. In this manner, the RDA method has been shown to exploit the regularization structure more easily in the online phase and obtain the desired regularization effects more efficiently.
Despite the extensive works in these different fields of machine learning, to the best of our knowledge, our current work represents the first effort to explore adaptive gradient optimization and second order learning techniques for online AUC maximization in both regular and sparse online learning settings.
3 Adaptive Subgradient Methods for OAM
3.1 Problem Setting
We aim to learn a linear classification model that maximizes AUC for a binary classification problem. Without loss of generality, we assume positive class to be less than negative class. Denote as the training instance received at the -th trial, where and , and is the weight vector learned so far.
Given this setting, let us define the AUC measurement  for binary classification task. Given a dataset , where , we divide it into two sets naturally: the set of positive instances and the set of negative instances , where and are the numbers of positive and negative instances, respectively. For a linear classifier , its AUC measurement on is defined as follows:
where is the indicator function that outputs a if the prediction holds and otherwise. We replace the indicator function with the following convex surrogate, i.e., the square loss from  due to its consistency with AUC 
and find the optimal classifier by minimizing the following objective function
where is introduced to regularize the complexity of the linear classifier. Note, the optimal satisfies according to the strong duality theorem.
3.2 Adaptive Online AUC Maximization
Here, we shall introduce the proposed Adaptive Online AUC Maximization (AdaOAM) algorithm. Following the similar approach in , we modify the loss function in (1) as a sum of losses for individual training instance where
for i.i.d. sequence , and it is an unbiased estimation to . and are denoted as the sets of positive and negative instances of respectively, and and are their respective cardinalities. Besides, is set as for . If , the gradient of is
If using and to refer to the mean and covariance matrix of negative class, respectively, the gradient of can be simplified as
Similarly, if ,
where and are the mean and covariance matrix of positive class, respectively.
Upon obtaining gradient , a simple solution is to move the weight in the opposite direction of , while keeping via the projected gradient update 
due to .
However, the above scheme is clearly insufficient, since it simply assigns different features with the same learning rate. In order to perform feature-wise gradient updating, we propose a second-order gradient optimization method, i.e., Adaptive Gradient Updating strategy, as inspired by . Specifically, we denote as the matrix obtained by concatenating the gradient sequences. The -th row of this matrix is , which is also a concatenation of the -th component of each gradient. In addition, we define the outer product matrix . Using these notations, the generalization of the standard adaptive gradient descent leads to the following weight update
where , which is the Mahalanobis norm to denote the projection of a point onto .
However, an obvious drawback of the above update lies in the significantly large amount of computational efforts needed to handle high-dimensional data tasks since it requires the calculations of the root and inverse root of the outer product matrix . In order to make the algorithm more efficient, we use the diagonal proxy of and thus the update becomes
In this way, both the root and inverse root of can be computed in linear time. Furthermore, as we discuss later, when the gradient vectors are sparse, the update above can be conducted more efficiently in time proportional to the support of the gradient.
Another issue with the updating rule (5) to be concern is that the may not be invertible in all coordinates. To address this issue, we replace it with , where is a smooth parameter. The parameter is introduced to make the diagonal matrix invertible and the algorithm robust, which is usually set as a very small value so that it has little influence on the learning results. Given , the update of the feature-wise adaptive update can be computed as:
The intuition of this update rule (6) is very natural, which considers the rare occurring features as more informative and discriminative than those frequently occurring features. Thus, these informative rare occurring features should be updated with higher learning rates by incorporating the geometrical property of the data observed in earlier stages. Besides, by using the previously observed gradients, the update process can mitigate the effects of noise and speed up the convergence rate intuitively.
So far, we have reached the key framework of the basic update rule for model learning except the details on gradient calculation. From the gradient derivation equations (3) and (4), we need to maintain and update the mean vectors and covariance matrices of the incoming instance sequences observed. The mean vectors are easy to be computed and stored here, while the covariance matrices are a bit difficult to be updated due to the online setting. Therefore, we provide a simplified update scheme for covariance matrix computation to address this issue. For an incoming instance sequence , the covariance matrix is given by
Then, in our gradient update, if setting and , the covariance matrices are updated as follows:
It can be observed that the above updates fit the online setting well. Finally, Algorithm 1 summarizes the proposed AdaOAM method.
3.3 Fast AdaOAM Algorithm for High-dimensional Sparse Data
A characteristic of the proposed AdaOAM algorithm described above is that it exploits the full features for weight learning, which may not be suitable or scalable for high-dimensional sparse data tasks. For example, in spam email detection tasks, the length of the vocabulary list can reach the million scale. Although the number of the features is large, many feature inputs are zero and do not provide any information to the detection task. The research work in  has shown that the classification performance saturates with dozens of features out of tens of thousands of features.
Taking the cue, in order to improve the efficiency and scalability of the AdaOAM algorithm on working with high-dimensional sparse data, we propose the Sparse AdaOAM algorithm (SAdaOAM) which learns a sparse linear classifier that contains a limited size of active features. In particular, SAdaOAM addresses the issue of sparsity in the learned model and maintains the efficacy of the original AdaOAM at the same time. To summarize, SAdaOAM has two benefits over the original AdaOAM algorithm: simple covariance matrix update and sparse model update. Next, we introduce these properties separately.
First, we employ a simpler covariance matrix update rule in the case of handling high-dimensional sparse data when compared to AdaOAM. The motivating factor behind a different update scheme here is because using the original covariance update rule of the AdaOAM method on high-dimensional data would lead to extreme high computational and storage costs, i.e. several matrix operations among multiple variables in the update formulations would be necessary. Therefore, we fall back to the standard definition of the covariance matrix and consider a simpler method for updates. Since the standard definition of the covariance matrix is , we just need to maintain the mean vector and the outer product of the instance at each iteration for the covariance update. In this case, we denote and . Then, the covariance matrices and can be formulated as
At each iteration, one only needs to update with and the mean vectors of the positive and negative instances, respectively, in the covariance matrices and . With the above update scheme, a lower computational and storage costs is attained since most of the elements in the covariance matrices are zero on high-dimensional sparse data.
After presenting the efficient scheme of updating the covariance matrices for high-dimensional sparse data, we proceed next to present the method for addressing the sparsity in the learned model and second-order adaptive gradient updates simultaneously. Here we consider to impose the soft-constraint norm regularization to the objective function (2). So, the new objective function is
In order to optimize this objective function, we apply the composite mirror descent method  that is able to achieve a trade-off between the immediate adaptive gradient term and the regularizer . We denote the -th diagonal element of the matrix as . Then, we give the derivation for the composite mirror descent gradient updates with regularization.
Following the framework of the composite mirror descent update in , the update needed to solve is
where and is the Bregman divergence associated with (see details in the proof of Theorem 1). After the expansion, this update amounts to
For easier derivation, we rearrange and rewrite the above objective function as
Let denote the optimal solution of the above optimization problem. Standard subgradient calculus indicates that when , the solution is . Similarly, when , then , the objective is differentiable, and the solution is achieved by setting the gradient to zero:
Similarly, when , then , and the solution is
Combining these three cases, we obtain the coordinate-wise update results for :
The complete sparse online AUC maximization approach using the adaptive gradient updating algorithm (SAdaOAM) is summarized in Algorithm 2.
From the Algorithm 2, it is observed that we perform “lazy” computation when the gradient vectors are sparse . Suppose that, from iteration step to , the -th component of the gradient is “0”. Then, we can evaluate the updates on demand since remains intact. Therefore, at iteration step when is needed, the update will be
where means . Obviously, this type of ”lazy” updates enjoys high efficiency.
4 Theoretical Analysis
In this section, we provide the regret bounds for the proposed set of AdaOAM algorithms for handling both regular and high-dimensional sparse data, respectively.
4.1 Regret Bounds with Regular Data
Firstly, we introduce two lemmas as follows, which will be used to facilitate our subsequent analyses.
Let , and be defined same in the Algorithm 1. Then
Let the sequence , be generated by the composite mirror descent update in Equation (12) and assume that . Using learning rate , for any optimal , the following bound holds
These two lemmas are actually the Lemma 4 and Corollary 1 in the paper .
Using these two lemmas, we can derive the following theorem for the proposed AdaOAM algorithm.
Assume and the diameter of is bounded via , we have
where , and .
We first define as
Based on the regularizer , it is easy to obtain due to the strong convexity property, and it is also reasonable to restrict with . Denote the projection of a point onto according to norm by , the AdaOAM actually employs the following update rule:
where and .
If we denote , and the dual norm of by , in which case , then it is easy to check the update rule 10 is the same with the following composite mirror descent method:
where the regularization function , and is the Bregman divergence associated with a strongly convex and differentiable function
Since we have in the case of regular data, the regret bound . Then, we follow the derivation results of  and attain the following regret bound
where is bounded via . Next, we would like to analyze the features’ dependency on the data of the gradient. Since
where is a constant to bound the scalar of the second term in the right side of the equation, and with , we have
Finally, combining the above inequalities, we arrive at
From the proof above, we can conclude that Algorithm 1 should have a lower regret than non-adaptive algorithms due to its dependence on the geometry of the underlying data space. If the features are normalized and sparse, the gradient terms in the bound should be much smaller than , which leads to lower regret and faster convergence. If the feature space is relative dense, then the convergence rate will be for the general case as in OPAUC and OAM methods.
4.2 Regret Bounds with High-dimensional Sparse Data
Assume and the diameter of is bounded via , the regret bound with respect to regularization term is
where , , and .
In the case of high-dimensional sparse adaptive online AUC maximization, the regret we plan to bound with respect to the optimal weight is formulated as
where is the regularization term to impose sparsity to the solution. Similarly, if denote , and the dual norm of by , in which case , it is easy to check the updating rule of SAdaOAM
is the same with the following one
where . From , we have
Furthermore, we assume and set , the final regret bound is
This theoretical result shows that the regret bound for sparse solution is the same as that in the case when . ∎
As discussed above, the SAdaOAM algorithm should have lower regret bound than non-adaptive algorithms do on high-dimensional sparse data, though this depends on the geometric property of the underlying feature distribution. If some features appear much more frequently than others, then indicates that we could have remarkably lower regret by using higher learning rates for infrequent features and lower learning rates for often occurring features.
5 Experimental Results
In this section, we evaluate the proposed set of the AdaOAM algorithms in terms of AUC performance, convergence rate, and examine their parameter sensitivity. The main framework of the experiments is based on the LIBOL, an open-source library for online learning algorithms 111http://libol.stevenhoi.org/ .
5.1 Comparison Algorithms
We conduct comprehensive empirical studies by comparing the proposed algorithms with various AUC optimization algorithms for both online and batch scenarios. Specifically, the algorithms considered in our experiments include:
Online Uni-Exp: An online learning algorithm which optimizes the (weighted) univariate exponential loss ;
Online Uni-Log: An online learning algorithm which optimizes the (weighted) univariate logistic loss ;
OAM: The OAM algorithm with reservoir sampling and sequential updating method ;
OAM: The OAM algorithm with reservoir sampling and online gradient updating method ;
OPAUC: The one-pass AUC optimization algorithm with square loss function ;
SVM-perf: A batch algorithm which directly optimizes AUC ;
CAPO: A batch algorithm which trains nonlinear auxiliary classifiers first and then adapts auxiliary classifiers for specific performance measures ;
Batch Uni-Log: A batch algorithm which optimizes the (weighted) univariate logistic loss ;
Batch Uni-Squ: A batch algorithm which optimizes the (weighted) univariate square loss;
AdaOAM: The proposed adaptive gradient method for online AUC maximization.
SAdaOAM: The proposed sparse adaptive subgradient method for online AUC maximization.
It is noted that the OAM, OAM, and OPAUC are the state-of-the-art methods for AUC maximization in online settings. For batch learning scenarios, CAPO and SVM-perf are both strong baselines to compare against.
5.2 Experimental Testbed and Setup
To examine the performance of the proposed AdaOAM in comparison to the existing state-of-the-art methods, we conduct extensive experiments on sixteen benchmark datasets by maintaining consistency to the previous studies on online AUC maximization [6, 7]. Table I shows the details of 16 binary-class datasets in our experiments. All of these datasets can be downloaded from the LIBSVM 222http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/ and UCI machine learning repository 333http://www.ics.uci.edu/~mlearn/MLRepository.html. Note that several datasets (svmguide4, vehicle) are originally multi-class, which were converted to class-imbalanced binary datasets for the purpose of in our experimental studies.
In the experiments, the features have been normalized fairly, i.e., , which is reasonable since instances are received sequentially in online learning setting. Each dataset has been randomly divided into 5 folds, in which 4 folds are for training and the remaining fold is for testing. We also generate 4 independent 5-fold partitions per dataset to further reduce the effects of random partition on the algorithms. Therefore, the reported AUC values are the average results of 20 runs for each dataset. 5-fold cross validation is conducted on the training sets to decide on the learning rate and the regularization parameter . For OAM and OAM, the buffer size is fixed at 100 as suggested in . All experiments for online setting comparisons were conducted with MATLAB on a computer workstation with 16GB memory and 3.20GHz CPU. On the other hand, for fair comparisons in batch settings, the core steps of the algorithms were implemented in C++ since we directly use the respective toolboxes 444http://www.cs.cornell.edu/people/tj/svm_light/svm_perf.html 555http://lamda.nju.edu.cn/code_CAPO.ashx provided by the respective authors of the SVM-perf and CAPO algorithms.
5.3 Evaluation of AdaOAM on Benchmark Datasets
Table II summarizes the average AUC performance of the algorithms under studied over the 16 datasets for online setting. In this table, we use to indicate that AdaOAM is significantly better/worse than the corresponding method (pairwise -tests at 95% significance level).
|datasets||AdaOAM||OPAUC||OAM||OAM||online Uni-Log||online Uni-Exp|
|glass||.816 .058||.804 .059|