Adaptive Subgradient Methods for Online AUC Maximization
Abstract
Learning for maximizing AUC performance is an important research problem in Machine Learning and Artificial Intelligence. Unlike traditional batch learning methods for maximizing AUC which often suffer from poor scalability, recent years have witnessed some emerging studies that attempt to maximize AUC by singlepass online learning approaches. Despite their encouraging results reported, the existing online AUC maximization algorithms often adopt simple online gradient descent approaches that fail to exploit the geometrical knowledge of the data observed during the online learning process, and thus could suffer from relatively larger regret. To address the above limitation, in this work, we explore a novel algorithm of Adaptive Online AUC Maximization (AdaOAM) which employs an adaptive gradient method that exploits the knowledge of historical gradients to perform more informative online learning. The new adaptive updating strategy of the AdaOAM is less sensitive to the parameter settings and maintains the same time complexity as previous nonadaptive counterparts. Additionally, we extend the algorithm to handle highdimensional sparse data (SAdaOAM) and address sparsity in the solution by performing lazy gradient updating. We analyze the theoretical bounds and evaluate their empirical performance on various types of data sets. The encouraging empirical results obtained clearly highlighted the effectiveness and efficiency of the proposed algorithms.
1 Introduction
AUC (Area Under ROC curve) [1] is an important measure for characterizing machine learning performances in many realworld applications, such as ranking, and anomaly detection tasks, especially when misclassification costs are unknown. In general, AUC measures the probability for a randomly drawn positive instance to have a higher decision value than a randomly sample negative instance. Many efforts have been devoted recently to developing efficient AUC optimization algorithms for both batch and online learning tasks [2, 3, 4, 5, 6, 7].
Due to its high efficiency and scalability in realworld applications, online AUC optimization for streaming data has been actively studied in the research community in recent years. The key challenge for AUC optimization in online setting is that AUC is a metric represented by the sum of pairwise losses between instances from different classes, which makes conventional online learning algorithms unsuitable for direct use in many real world scenarios. To address this challenge, two core types of Online AUC Maximization (OAM) frameworks have been proposed recently. The first framework is based on the idea of buffer sampling [6, 8], which stores some randomly sampled historical examples in a buffer to represent the observed data for calculating the pairwise loss functions. The other framework focuses on onepass AUC optimization [7], where the algorithm scan through the training data only once. The benefit of onepass AUC optimization lies in the use of squared loss to represent the AUC loss function while providing proofs on its consistency with the AUC measure [9].
Although these algorithms have been shown to be capable of achieving fairly good AUC performances, they share a common trait of employing the online gradient descent technique, which fail to take advantage of the geometrical property of the data observed from the online learning process, while recent studies have shown the importance of exploiting this information for online optimization [10]. To overcome the limitation of the existing works, we propose a novel framework of Adaptive Online AUC maximization (AdaOAM), which considers the adaptive gradient optimization technique for exploiting the geometric property of the observed data to accelerate online AUC maximization tasks. Specifically, the technique is motivated by a simple intuition, that is, the frequently occurring features in online learning process should be assigned with low learning rates while the rarely occurring features should be given high learning rates. To achieve this purpose, we propose the AdaOAM algorithm by adopting the adaptive gradient updating framework proposed by [10] to control the learning rates for different features. We theoretically prove that the regret bound of the proposed algorithm is better than those of the existing nonadaptive algorithms. We also empirically compared the proposed algorithm with several stateoftheart online AUC optimization algorithms on both benchmark datasets and realworld online anomaly detection datasets. The promising results validate the effectiveness and efficiency of the proposed AdaOAM.
To further handle highdimensional sparse tasks in practice, we investigate an extension of the AdaOAM method, which is labeled here as the Sparse AdaOAM method (SAdaOAM). The motivation is that because the regular AdaOAM algorithm assumes every feature is relevant and thus most of the weights for corresponding features are often nonzero, which leads to redundancy and low efficiency when rare features are informative for high dimension tasks in practice. To make AdaOAM more suitable for such cases, the SAdaOAM algorithm is proposed by inducing sparsity in the learning weights using adaptive proximal online gradient descent. To the best of our knowledge, this is the first effort to address the problem of keeping the online model sparse in online AUC maximization task. Moreover, we have theoretically analyzed this algorithm, and empirically evaluated it on an extensive set of realworld public datasets, compared with several stateoftheart online AUC maximization algorithms. Promising results have been obtained that validate the effectiveness and efficacy of the proposed SAdaOAM.
The rest of this paper is organized as follows. We first review the related works from three core areas: online learning, AUC maximization, and sparse online learning, respectively. Then, we present the formulations of the proposed approaches for handling both regular and highdimensional sparse data, and their theoretical analysis; we further show and discuss the comprehensive experimental results, the sensitivity of the parameters, and tradeoffs between the level of sparsity and AUC performances. Finally, we conclude the paper with a brief summary of the present work.
2 Related Work
Our work is closely related to three topics in the context of machine learning, namely, online learning, AUC maximization, and sparse online learning. Below we briefly review some of the important related work in these areas.
Online Learning. Online learning has been extensively studied in the machine learning communities [11, 12, 13, 14, 15], mainly due to its high efficiency and scalability to largescale learning tasks. Different from conventional batch learning methods that assume all training instances are available prior to the learning phase, online learning considers one instance each time to update the model sequentially and iteratively. Therefore, online learning is ideally appropriate for tasks in which data arrives sequentially. A number of firstorder algorithms have been proposed including the wellknown Perceptron algorithm [16] and the PassiveAggressive (PA) algorithm [12]. Although the PA introduces the concept of “maximum margin” for classification, it fails to control the direction and scale of parameter updates during online learning phase. In order to address this issue, recent years have witnessed some secondorder online learning algorithms [17, 18, 19, 20], which apply parameter confidence information to improve online learning performance. Further, in order to solve the costsensitive classification tasks onthefly, online learning researchers have also proposed a few novel online learning algorithms to directly optimize some more meaningful costsensitive metrics [21, 22, 23].
AUC Maximization. AUC (Area Under ROC curve) is an important performance measure that has been widely used in imbalanced data distribution classification. The ROC curve explains the rate of the true positive against the false positive at various range of threshold. Thus, AUC represents the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. Recently, many algorithms have been developed to optimize AUC directly [2, 3, 4, 6, 7]. In [4], the author firstly presented a general framework for optimizing multivariate nonlinear performance measures such as the AUC, F1, etc. in a batch mode. Online learning algorithms for AUC maximization involving largescale applications have also been studied. Among the online AUC maximization approaches, two core online AUC optimization frameworks have been proposed very recently. The first framework is based on the idea of buffer sampling [6, 8], which employed a fixedsize buffer to represent the observed data for calculating the pairwise loss functions. A representative study is available in [6], which leveraged the reservoir sampling technique to represent the observed data instances by a fixedsize buffer where notable theoretical and empirical results have been reported. Then, [8] studied the improved generalization capability of online learning algorithms for pairwise loss functions with the framework of buffer sampling. The main contribution of their work is the introduction of the stream subsampling with replacement as the buffer update strategy. The other framework which takes a different perspective was presented by [7]. They extended the previous online AUC maximization framework with a regressionbased onepass learning mode, and achieved solid regret bounds by considering square loss for the AUC optimization task due to its theoretical consistency with AUC.
Sparse Online Learning. The high dimensionality and high sparsity are two important issues for largescale machine learning tasks. Many previous efforts have been devoted to tackling these issues in the batch setting, but they usually suffer from poor scalability when dealing with big data. Recent years have witnessed extensive research studies on sparse online learning [24, 25, 26, 27], which aim to learn sparse classifiers by limiting the number of active features. There are two core categories of methods for sparse online learning. The representative work of the first type follows the general framework of subgradient descent with truncation. Taking the FOBOS algorithm [25] as an example, which is based on the ForwardBackward Splitting method to solve the sparse online learning problem by alternating between two phases: (i) an unconstraint stochastic subgradient descent step with respect to the loss function, and (ii) an instantaneous optimization for a tradeoff between keeping close proximity to the result of the first step and minimizing regularization term. Following this strategy, [24] argues that the truncation at each step is too aggressive and thus proposes the Truncated Gradient (TG) method, which alleviates the updates by truncating the coefficients at every K steps when they are lower than a predefined threshold. The second category of methods are mainly motivated by the dual averaging method [28]. The most popular method in this category is the Regularized Dual Averaging (RDA) [26], which solves the optimization problem by using the running average of all past subgradients of the loss functions and the whole regularization term instead of the subgradient. In this manner, the RDA method has been shown to exploit the regularization structure more easily in the online phase and obtain the desired regularization effects more efficiently.
Despite the extensive works in these different fields of machine learning, to the best of our knowledge, our current work represents the first effort to explore adaptive gradient optimization and second order learning techniques for online AUC maximization in both regular and sparse online learning settings.
3 Adaptive Subgradient Methods for OAM
3.1 Problem Setting
We aim to learn a linear classification model that maximizes AUC for a binary classification problem. Without loss of generality, we assume positive class to be less than negative class. Denote as the training instance received at the th trial, where and , and is the weight vector learned so far.
Given this setting, let us define the AUC measurement [1] for binary classification task. Given a dataset , where , we divide it into two sets naturally: the set of positive instances and the set of negative instances , where and are the numbers of positive and negative instances, respectively. For a linear classifier , its AUC measurement on is defined as follows:
where is the indicator function that outputs a if the prediction holds and otherwise. We replace the indicator function with the following convex surrogate, i.e., the square loss from [7] due to its consistency with AUC [9]
and find the optimal classifier by minimizing the following objective function
(1) 
where is introduced to regularize the complexity of the linear classifier. Note, the optimal satisfies according to the strong duality theorem.
3.2 Adaptive Online AUC Maximization
Here, we shall introduce the proposed Adaptive Online AUC Maximization (AdaOAM) algorithm. Following the similar approach in [7], we modify the loss function in (1) as a sum of losses for individual training instance where
(2) 
for i.i.d. sequence , and it is an unbiased estimation to . and are denoted as the sets of positive and negative instances of respectively, and and are their respective cardinalities. Besides, is set as for . If , the gradient of is
If using and to refer to the mean and covariance matrix of negative class, respectively, the gradient of can be simplified as
(3) 
Similarly, if ,
(4) 
where and are the mean and covariance matrix of positive class, respectively.
Upon obtaining gradient , a simple solution is to move the weight in the opposite direction of , while keeping via the projected gradient update [29]
due to .
However, the above scheme is clearly insufficient, since it simply assigns different features with the same learning rate. In order to perform featurewise gradient updating, we propose a secondorder gradient optimization method, i.e., Adaptive Gradient Updating strategy, as inspired by [10]. Specifically, we denote as the matrix obtained by concatenating the gradient sequences. The th row of this matrix is , which is also a concatenation of the th component of each gradient. In addition, we define the outer product matrix . Using these notations, the generalization of the standard adaptive gradient descent leads to the following weight update
where , which is the Mahalanobis norm to denote the projection of a point onto .
However, an obvious drawback of the above update lies in the significantly large amount of computational efforts needed to handle highdimensional data tasks since it requires the calculations of the root and inverse root of the outer product matrix . In order to make the algorithm more efficient, we use the diagonal proxy of and thus the update becomes
(5) 
In this way, both the root and inverse root of can be computed in linear time. Furthermore, as we discuss later, when the gradient vectors are sparse, the update above can be conducted more efficiently in time proportional to the support of the gradient.
Another issue with the updating rule (5) to be concern is that the may not be invertible in all coordinates. To address this issue, we replace it with , where is a smooth parameter. The parameter is introduced to make the diagonal matrix invertible and the algorithm robust, which is usually set as a very small value so that it has little influence on the learning results. Given , the update of the featurewise adaptive update can be computed as:
(6) 
The intuition of this update rule (6) is very natural, which considers the rare occurring features as more informative and discriminative than those frequently occurring features. Thus, these informative rare occurring features should be updated with higher learning rates by incorporating the geometrical property of the data observed in earlier stages. Besides, by using the previously observed gradients, the update process can mitigate the effects of noise and speed up the convergence rate intuitively.
So far, we have reached the key framework of the basic update rule for model learning except the details on gradient calculation. From the gradient derivation equations (3) and (4), we need to maintain and update the mean vectors and covariance matrices of the incoming instance sequences observed. The mean vectors are easy to be computed and stored here, while the covariance matrices are a bit difficult to be updated due to the online setting. Therefore, we provide a simplified update scheme for covariance matrix computation to address this issue. For an incoming instance sequence , the covariance matrix is given by
Then, in our gradient update, if setting and , the covariance matrices are updated as follows:
It can be observed that the above updates fit the online setting well. Finally, Algorithm 1 summarizes the proposed AdaOAM method.
3.3 Fast AdaOAM Algorithm for Highdimensional Sparse Data
A characteristic of the proposed AdaOAM algorithm described above is that it exploits the full features for weight learning, which may not be suitable or scalable for highdimensional sparse data tasks. For example, in spam email detection tasks, the length of the vocabulary list can reach the million scale. Although the number of the features is large, many feature inputs are zero and do not provide any information to the detection task. The research work in [30] has shown that the classification performance saturates with dozens of features out of tens of thousands of features.
Taking the cue, in order to improve the efficiency and scalability of the AdaOAM algorithm on working with highdimensional sparse data, we propose the Sparse AdaOAM algorithm (SAdaOAM) which learns a sparse linear classifier that contains a limited size of active features. In particular, SAdaOAM addresses the issue of sparsity in the learned model and maintains the efficacy of the original AdaOAM at the same time. To summarize, SAdaOAM has two benefits over the original AdaOAM algorithm: simple covariance matrix update and sparse model update. Next, we introduce these properties separately.
First, we employ a simpler covariance matrix update rule in the case of handling highdimensional sparse data when compared to AdaOAM. The motivating factor behind a different update scheme here is because using the original covariance update rule of the AdaOAM method on highdimensional data would lead to extreme high computational and storage costs, i.e. several matrix operations among multiple variables in the update formulations would be necessary. Therefore, we fall back to the standard definition of the covariance matrix and consider a simpler method for updates. Since the standard definition of the covariance matrix is , we just need to maintain the mean vector and the outer product of the instance at each iteration for the covariance update. In this case, we denote and . Then, the covariance matrices and can be formulated as
At each iteration, one only needs to update with and the mean vectors of the positive and negative instances, respectively, in the covariance matrices and . With the above update scheme, a lower computational and storage costs is attained since most of the elements in the covariance matrices are zero on highdimensional sparse data.
After presenting the efficient scheme of updating the covariance matrices for highdimensional sparse data, we proceed next to present the method for addressing the sparsity in the learned model and secondorder adaptive gradient updates simultaneously. Here we consider to impose the softconstraint norm regularization to the objective function (2). So, the new objective function is
(7) 
In order to optimize this objective function, we apply the composite mirror descent method [31] that is able to achieve a tradeoff between the immediate adaptive gradient term and the regularizer . We denote the th diagonal element of the matrix as . Then, we give the derivation for the composite mirror descent gradient updates with regularization.
Following the framework of the composite mirror descent update in [10], the update needed to solve is
(8) 
where and is the Bregman divergence associated with (see details in the proof of Theorem 1). After the expansion, this update amounts to
(9) 
For easier derivation, we rearrange and rewrite the above objective function as
Let denote the optimal solution of the above optimization problem. Standard subgradient calculus indicates that when , the solution is . Similarly, when , then , the objective is differentiable, and the solution is achieved by setting the gradient to zero:
so that
Similarly, when , then , and the solution is
Combining these three cases, we obtain the coordinatewise update results for :
The complete sparse online AUC maximization approach using the adaptive gradient updating algorithm (SAdaOAM) is summarized in Algorithm 2.
From the Algorithm 2, it is observed that we perform “lazy” computation when the gradient vectors are sparse [10]. Suppose that, from iteration step to , the th component of the gradient is “0”. Then, we can evaluate the updates on demand since remains intact. Therefore, at iteration step when is needed, the update will be
where means . Obviously, this type of ”lazy” updates enjoys high efficiency.
4 Theoretical Analysis
In this section, we provide the regret bounds for the proposed set of AdaOAM algorithms for handling both regular and highdimensional sparse data, respectively.
4.1 Regret Bounds with Regular Data
Firstly, we introduce two lemmas as follows, which will be used to facilitate our subsequent analyses.
Lemma 1.
Let , and be defined same in the Algorithm 1. Then
Lemma 2.
Let the sequence , be generated by the composite mirror descent update in Equation (12) and assume that . Using learning rate , for any optimal , the following bound holds
These two lemmas are actually the Lemma 4 and Corollary 1 in the paper [10].
Using these two lemmas, we can derive the following theorem for the proposed AdaOAM algorithm.
Theorem 1.
Assume and the diameter of is bounded via , we have
where , and .
Proof.
We first define as
Based on the regularizer , it is easy to obtain due to the strong convexity property, and it is also reasonable to restrict with . Denote the projection of a point onto according to norm by , the AdaOAM actually employs the following update rule:
(10) 
where and .
If we denote , and the dual norm of by , in which case , then it is easy to check the update rule 10 is the same with the following composite mirror descent method:
(11) 
where the regularization function , and is the Bregman divergence associated with a strongly convex and differentiable function
Since we have in the case of regular data, the regret bound . Then, we follow the derivation results of [10] and attain the following regret bound
where is bounded via . Next, we would like to analyze the features’ dependency on the data of the gradient. Since
where is a constant to bound the scalar of the second term in the right side of the equation, and with , we have
Finally, combining the above inequalities, we arrive at
∎
From the proof above, we can conclude that Algorithm 1 should have a lower regret than nonadaptive algorithms due to its dependence on the geometry of the underlying data space. If the features are normalized and sparse, the gradient terms in the bound should be much smaller than , which leads to lower regret and faster convergence. If the feature space is relative dense, then the convergence rate will be for the general case as in OPAUC and OAM methods.
4.2 Regret Bounds with Highdimensional Sparse Data
Theorem 2.
Assume and the diameter of is bounded via , the regret bound with respect to regularization term is
where , , and .
Proof.
In the case of highdimensional sparse adaptive online AUC maximization, the regret we plan to bound with respect to the optimal weight is formulated as
where is the regularization term to impose sparsity to the solution. Similarly, if denote , and the dual norm of by , in which case , it is easy to check the updating rule of SAdaOAM
is the same with the following one
(12) 
where . From [10], we have
Furthermore, we assume and set , the final regret bound is
This theoretical result shows that the regret bound for sparse solution is the same as that in the case when . ∎
As discussed above, the SAdaOAM algorithm should have lower regret bound than nonadaptive algorithms do on highdimensional sparse data, though this depends on the geometric property of the underlying feature distribution. If some features appear much more frequently than others, then indicates that we could have remarkably lower regret by using higher learning rates for infrequent features and lower learning rates for often occurring features.
5 Experimental Results
In this section, we evaluate the proposed set of the AdaOAM algorithms in terms of AUC performance, convergence rate, and examine their parameter sensitivity. The main framework of the experiments is based on the LIBOL, an opensource library for online learning algorithms ^{1}^{1}1http://libol.stevenhoi.org/ [32].
5.1 Comparison Algorithms
We conduct comprehensive empirical studies by comparing the proposed algorithms with various AUC optimization algorithms for both online and batch scenarios. Specifically, the algorithms considered in our experiments include:

Online UniExp: An online learning algorithm which optimizes the (weighted) univariate exponential loss [33];

Online UniLog: An online learning algorithm which optimizes the (weighted) univariate logistic loss [33];

OAM: The OAM algorithm with reservoir sampling and sequential updating method [6];

OAM: The OAM algorithm with reservoir sampling and online gradient updating method [6];

OPAUC: The onepass AUC optimization algorithm with square loss function [7];

SVMperf: A batch algorithm which directly optimizes AUC [4];

CAPO: A batch algorithm which trains nonlinear auxiliary classifiers first and then adapts auxiliary classifiers for specific performance measures [34];

Batch UniLog: A batch algorithm which optimizes the (weighted) univariate logistic loss [33];

Batch UniSqu: A batch algorithm which optimizes the (weighted) univariate square loss;

AdaOAM: The proposed adaptive gradient method for online AUC maximization.

SAdaOAM: The proposed sparse adaptive subgradient method for online AUC maximization.
It is noted that the OAM, OAM, and OPAUC are the stateoftheart methods for AUC maximization in online settings. For batch learning scenarios, CAPO and SVMperf are both strong baselines to compare against.
5.2 Experimental Testbed and Setup
To examine the performance of the proposed AdaOAM in comparison to the existing stateoftheart methods, we conduct extensive experiments on sixteen benchmark datasets by maintaining consistency to the previous studies on online AUC maximization [6, 7]. Table I shows the details of 16 binaryclass datasets in our experiments. All of these datasets can be downloaded from the LIBSVM ^{2}^{2}2http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/ and UCI machine learning repository ^{3}^{3}3http://www.ics.uci.edu/~mlearn/MLRepository.html. Note that several datasets (svmguide4, vehicle) are originally multiclass, which were converted to classimbalanced binary datasets for the purpose of in our experimental studies.
datasets  inst  dim  datasets  inst  dim  

glass  214  9  2.057  vehicle  846  18  3.251 
heart  270  13  1.250  german  1,000  24  2.333 
svmguide4  300  10  5.818  svmguide3  1,243  22  3.199 
liverdisorders  345  6  1.379  a2a  2,265  123  2.959 
balance  625  4  11.755  magic04  19,020  10  1.843 
breast  683  10  1.857  codrna  59,535  8  2.000 
australian  690  14  1.247  acoustic  78,823  50  3.316 
diabetes  768  8  1.865  poker  1025,010  11  10.000 
In the experiments, the features have been normalized fairly, i.e., , which is reasonable since instances are received sequentially in online learning setting. Each dataset has been randomly divided into 5 folds, in which 4 folds are for training and the remaining fold is for testing. We also generate 4 independent 5fold partitions per dataset to further reduce the effects of random partition on the algorithms. Therefore, the reported AUC values are the average results of 20 runs for each dataset. 5fold cross validation is conducted on the training sets to decide on the learning rate and the regularization parameter . For OAM and OAM, the buffer size is fixed at 100 as suggested in [6]. All experiments for online setting comparisons were conducted with MATLAB on a computer workstation with 16GB memory and 3.20GHz CPU. On the other hand, for fair comparisons in batch settings, the core steps of the algorithms were implemented in C++ since we directly use the respective toolboxes ^{4}^{4}4http://www.cs.cornell.edu/people/tj/svm_light/svm_perf.html ^{5}^{5}5http://lamda.nju.edu.cn/code_CAPO.ashx provided by the respective authors of the SVMperf and CAPO algorithms.
5.3 Evaluation of AdaOAM on Benchmark Datasets
Table II summarizes the average AUC performance of the algorithms under studied over the 16 datasets for online setting. In this table, we use to indicate that AdaOAM is significantly better/worse than the corresponding method (pairwise tests at 95% significance level).
datasets  AdaOAM  OPAUC  OAM  OAM  online UniLog  online UniExp 

glass  .816 .058  .804 .059 