Yi Ding, Peilin Zhao, Steven C.H. Hoi, Yew-Soon Ong Yi Ding is with the Department of Computer Science, The University of Chicago, Chicago, IL, USA, 60637, E-mail: dingy@uchicago.edu.
Corresponding author: Steven C.H. Hoi is with the School of Information Systems, Singapore Management University, Singapore 178902, E-mail: chhoi@smu.edu.sg.
Peilin Zhao is with the Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore 138632, E-mail: zhaop@i2r.a-star.edu.sg.
Yew-Soon Ong is with the School of Computer Engineering, Nanyang Technological University, Singapore 639798, E-mail: ASYSOng@ntu.edu.sg.

## 1 Introduction

AUC (Area Under ROC curve) [1] is an important measure for characterizing machine learning performances in many real-world applications, such as ranking, and anomaly detection tasks, especially when misclassification costs are unknown. In general, AUC measures the probability for a randomly drawn positive instance to have a higher decision value than a randomly sample negative instance. Many efforts have been devoted recently to developing efficient AUC optimization algorithms for both batch and online learning tasks [2, 3, 4, 5, 6, 7].

Due to its high efficiency and scalability in real-world applications, online AUC optimization for streaming data has been actively studied in the research community in recent years. The key challenge for AUC optimization in online setting is that AUC is a metric represented by the sum of pairwise losses between instances from different classes, which makes conventional online learning algorithms unsuitable for direct use in many real world scenarios. To address this challenge, two core types of Online AUC Maximization (OAM) frameworks have been proposed recently. The first framework is based on the idea of buffer sampling [6, 8], which stores some randomly sampled historical examples in a buffer to represent the observed data for calculating the pairwise loss functions. The other framework focuses on one-pass AUC optimization [7], where the algorithm scan through the training data only once. The benefit of one-pass AUC optimization lies in the use of squared loss to represent the AUC loss function while providing proofs on its consistency with the AUC measure [9].

The rest of this paper is organized as follows. We first review the related works from three core areas: online learning, AUC maximization, and sparse online learning, respectively. Then, we present the formulations of the proposed approaches for handling both regular and high-dimensional sparse data, and their theoretical analysis; we further show and discuss the comprehensive experimental results, the sensitivity of the parameters, and tradeoffs between the level of sparsity and AUC performances. Finally, we conclude the paper with a brief summary of the present work.

## 2 Related Work

Our work is closely related to three topics in the context of machine learning, namely, online learning, AUC maximization, and sparse online learning. Below we briefly review some of the important related work in these areas.

Online Learning. Online learning has been extensively studied in the machine learning communities [11, 12, 13, 14, 15], mainly due to its high efficiency and scalability to large-scale learning tasks. Different from conventional batch learning methods that assume all training instances are available prior to the learning phase, online learning considers one instance each time to update the model sequentially and iteratively. Therefore, online learning is ideally appropriate for tasks in which data arrives sequentially. A number of first-order algorithms have been proposed including the well-known Perceptron algorithm [16] and the Passive-Aggressive (PA) algorithm [12]. Although the PA introduces the concept of “maximum margin” for classification, it fails to control the direction and scale of parameter updates during online learning phase. In order to address this issue, recent years have witnessed some second-order online learning algorithms [17, 18, 19, 20], which apply parameter confidence information to improve online learning performance. Further, in order to solve the cost-sensitive classification tasks on-the-fly, online learning researchers have also proposed a few novel online learning algorithms to directly optimize some more meaningful cost-sensitive metrics [21, 22, 23].

AUC Maximization. AUC (Area Under ROC curve) is an important performance measure that has been widely used in imbalanced data distribution classification. The ROC curve explains the rate of the true positive against the false positive at various range of threshold. Thus, AUC represents the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. Recently, many algorithms have been developed to optimize AUC directly [2, 3, 4, 6, 7]. In [4], the author firstly presented a general framework for optimizing multivariate nonlinear performance measures such as the AUC, F1, etc. in a batch mode. Online learning algorithms for AUC maximization involving large-scale applications have also been studied. Among the online AUC maximization approaches, two core online AUC optimization frameworks have been proposed very recently. The first framework is based on the idea of buffer sampling [6, 8], which employed a fixed-size buffer to represent the observed data for calculating the pairwise loss functions. A representative study is available in [6], which leveraged the reservoir sampling technique to represent the observed data instances by a fixed-size buffer where notable theoretical and empirical results have been reported. Then, [8] studied the improved generalization capability of online learning algorithms for pairwise loss functions with the framework of buffer sampling. The main contribution of their work is the introduction of the stream subsampling with replacement as the buffer update strategy. The other framework which takes a different perspective was presented by [7]. They extended the previous online AUC maximization framework with a regression-based one-pass learning mode, and achieved solid regret bounds by considering square loss for the AUC optimization task due to its theoretical consistency with AUC.

Sparse Online Learning. The high dimensionality and high sparsity are two important issues for large-scale machine learning tasks. Many previous efforts have been devoted to tackling these issues in the batch setting, but they usually suffer from poor scalability when dealing with big data. Recent years have witnessed extensive research studies on sparse online learning [24, 25, 26, 27], which aim to learn sparse classifiers by limiting the number of active features. There are two core categories of methods for sparse online learning. The representative work of the first type follows the general framework of subgradient descent with truncation. Taking the FOBOS algorithm [25] as an example, which is based on the Forward-Backward Splitting method to solve the sparse online learning problem by alternating between two phases: (i) an unconstraint stochastic subgradient descent step with respect to the loss function, and (ii) an instantaneous optimization for a tradeoff between keeping close proximity to the result of the first step and minimizing regularization term. Following this strategy,  [24] argues that the truncation at each step is too aggressive and thus proposes the Truncated Gradient (TG) method, which alleviates the updates by truncating the coefficients at every K steps when they are lower than a predefined threshold. The second category of methods are mainly motivated by the dual averaging method [28]. The most popular method in this category is the Regularized Dual Averaging (RDA) [26], which solves the optimization problem by using the running average of all past subgradients of the loss functions and the whole regularization term instead of the subgradient. In this manner, the RDA method has been shown to exploit the regularization structure more easily in the online phase and obtain the desired regularization effects more efficiently.

Despite the extensive works in these different fields of machine learning, to the best of our knowledge, our current work represents the first effort to explore adaptive gradient optimization and second order learning techniques for online AUC maximization in both regular and sparse online learning settings.

### 3.1 Problem Setting

We aim to learn a linear classification model that maximizes AUC for a binary classification problem. Without loss of generality, we assume positive class to be less than negative class. Denote as the training instance received at the -th trial, where and , and is the weight vector learned so far.

Given this setting, let us define the AUC measurement [1] for binary classification task. Given a dataset , where , we divide it into two sets naturally: the set of positive instances and the set of negative instances , where and are the numbers of positive and negative instances, respectively. For a linear classifier , its AUC measurement on is defined as follows:

 AUC(w)=∑n+i=1∑n−j=1I(w⋅x+i>w⋅x−j)+12I(w⋅x+i=w⋅x−j)n+n−,

where is the indicator function that outputs a if the prediction holds and otherwise. We replace the indicator function with the following convex surrogate, i.e., the square loss from [7] due to its consistency with AUC [9]

and find the optimal classifier by minimizing the following objective function

 L(w)=λ2∥w∥22+n+∑i=1n−∑j=1ℓ(w,x+i−x−j)2n+n−. (1)

where is introduced to regularize the complexity of the linear classifier. Note, the optimal satisfies according to the strong duality theorem.

### 3.2 Adaptive Online AUC Maximization

Here, we shall introduce the proposed Adaptive Online AUC Maximization (AdaOAM) algorithm. Following the similar approach in [7], we modify the loss function in (1) as a sum of losses for individual training instance where

 (2)

for i.i.d. sequence , and it is an unbiased estimation to . and are denoted as the sets of positive and negative instances of respectively, and and are their respective cardinalities. Besides, is set as for . If , the gradient of is

 ∇Lt(w)=λw+xtx⊤tw−xt+∑i:yi=−1xi+(xix⊤i−xix⊤t−xtx⊤i)wT−t.

If using and to refer to the mean and covariance matrix of negative class, respectively, the gradient of can be simplified as

 (3)

Similarly, if ,

 (4)

where and are the mean and covariance matrix of positive class, respectively.

Upon obtaining gradient , a simple solution is to move the weight in the opposite direction of , while keeping via the projected gradient update [29]

 wt+1=Π1√λ(wt−ηgt)=argmin∥w∥≤1√λ∥w−(wt−ηgt)∥22,

due to .

However, the above scheme is clearly insufficient, since it simply assigns different features with the same learning rate. In order to perform feature-wise gradient updating, we propose a second-order gradient optimization method, i.e., Adaptive Gradient Updating strategy, as inspired by [10]. Specifically, we denote as the matrix obtained by concatenating the gradient sequences. The -th row of this matrix is , which is also a concatenation of the -th component of each gradient. In addition, we define the outer product matrix . Using these notations, the generalization of the standard adaptive gradient descent leads to the following weight update

 wt+1=ΠG1/2t1√λ(wt−ηG−1/2tgt),

where , which is the Mahalanobis norm to denote the projection of a point onto .

However, an obvious drawback of the above update lies in the significantly large amount of computational efforts needed to handle high-dimensional data tasks since it requires the calculations of the root and inverse root of the outer product matrix . In order to make the algorithm more efficient, we use the diagonal proxy of and thus the update becomes

 wt+1=Πdiag(Gt)1/21√λ(wt−ηdiag(Gt)−1/2gt). (5)

In this way, both the root and inverse root of can be computed in linear time. Furthermore, as we discuss later, when the gradient vectors are sparse, the update above can be conducted more efficiently in time proportional to the support of the gradient.

Another issue with the updating rule (5) to be concern is that the may not be invertible in all coordinates. To address this issue, we replace it with , where is a smooth parameter. The parameter is introduced to make the diagonal matrix invertible and the algorithm robust, which is usually set as a very small value so that it has little influence on the learning results. Given , the update of the feature-wise adaptive update can be computed as:

 wt+1=ΠHt1√λ(wt−ηH−1tgt). (6)

The intuition of this update rule (6) is very natural, which considers the rare occurring features as more informative and discriminative than those frequently occurring features. Thus, these informative rare occurring features should be updated with higher learning rates by incorporating the geometrical property of the data observed in earlier stages. Besides, by using the previously observed gradients, the update process can mitigate the effects of noise and speed up the convergence rate intuitively.

So far, we have reached the key framework of the basic update rule for model learning except the details on gradient calculation. From the gradient derivation equations (3) and (4), we need to maintain and update the mean vectors and covariance matrices of the incoming instance sequences observed. The mean vectors are easy to be computed and stored here, while the covariance matrices are a bit difficult to be updated due to the online setting. Therefore, we provide a simplified update scheme for covariance matrix computation to address this issue. For an incoming instance sequence , the covariance matrix is given by

Then, in our gradient update, if setting and , the covariance matrices are updated as follows:

 Γ−t=Γ−t−1+c−t−1[c−t−1]⊤−c−t[c−t]⊤+xtx⊤t−Γ−t−1−c−t−1[c−t−1]⊤T−t.

It can be observed that the above updates fit the online setting well. Finally, Algorithm 1 summarizes the proposed AdaOAM method.

### 3.3 Fast AdaOAM Algorithm for High-dimensional Sparse Data

A characteristic of the proposed AdaOAM algorithm described above is that it exploits the full features for weight learning, which may not be suitable or scalable for high-dimensional sparse data tasks. For example, in spam email detection tasks, the length of the vocabulary list can reach the million scale. Although the number of the features is large, many feature inputs are zero and do not provide any information to the detection task. The research work in [30] has shown that the classification performance saturates with dozens of features out of tens of thousands of features.

Taking the cue, in order to improve the efficiency and scalability of the AdaOAM algorithm on working with high-dimensional sparse data, we propose the Sparse AdaOAM algorithm (SAdaOAM) which learns a sparse linear classifier that contains a limited size of active features. In particular, SAdaOAM addresses the issue of sparsity in the learned model and maintains the efficacy of the original AdaOAM at the same time. To summarize, SAdaOAM has two benefits over the original AdaOAM algorithm: simple covariance matrix update and sparse model update. Next, we introduce these properties separately.

First, we employ a simpler covariance matrix update rule in the case of handling high-dimensional sparse data when compared to AdaOAM. The motivating factor behind a different update scheme here is because using the original covariance update rule of the AdaOAM method on high-dimensional data would lead to extreme high computational and storage costs, i.e. several matrix operations among multiple variables in the update formulations would be necessary. Therefore, we fall back to the standard definition of the covariance matrix and consider a simpler method for updates. Since the standard definition of the covariance matrix is , we just need to maintain the mean vector and the outer product of the instance at each iteration for the covariance update. In this case, we denote and . Then, the covariance matrices and can be formulated as

 S+t=Z+t/T+t−c+t[c+t]⊤andS−t=Z−t/T−t−c−t[c−t]⊤.

At each iteration, one only needs to update with and the mean vectors of the positive and negative instances, respectively, in the covariance matrices and . With the above update scheme, a lower computational and storage costs is attained since most of the elements in the covariance matrices are zero on high-dimensional sparse data.

After presenting the efficient scheme of updating the covariance matrices for high-dimensional sparse data, we proceed next to present the method for addressing the sparsity in the learned model and second-order adaptive gradient updates simultaneously. Here we consider to impose the soft-constraint norm regularization to the objective function (2). So, the new objective function is

 (7)

In order to optimize this objective function, we apply the composite mirror descent method [31] that is able to achieve a trade-off between the immediate adaptive gradient term and the regularizer . We denote the -th diagonal element of the matrix as . Then, we give the derivation for the composite mirror descent gradient updates with regularization.

Following the framework of the composite mirror descent update in [10], the update needed to solve is

 wt+1=argmin∥w∥≤1√λ{η⟨gt,w⟩+ηφ(w)+Bψt(w,wt)}, (8)

where and is the Bregman divergence associated with (see details in the proof of Theorem 1). After the expansion, this update amounts to

 (9)

For easier derivation, we rearrange and rewrite the above objective function as

 minw⟨ηgt−Htwt,w⟩+12⟨w,Htw⟩+12⟨wt,Htwt⟩+ηθ∥w∥1.

Let denote the optimal solution of the above optimization problem. Standard subgradient calculus indicates that when , the solution is . Similarly, when , then , the objective is differentiable, and the solution is achieved by setting the gradient to zero:

 ηgt,i−Ht,iiwt,i−Ht,ii^wi−ηθ=0,

so that

 ^wi=ηHt,iigt,i−wt,i−ηθHt,ii.

Similarly, when , then , and the solution is

 ^wi=wt,i−ηHt,iigt,i−ηθHt,ii.

Combining these three cases, we obtain the coordinate-wise update results for :

 wt+1,i=sign(wt,i−ηHt,iigt,i)[|wt,i−ηHt,iigt,i|−ηθHt,ii]+.

The complete sparse online AUC maximization approach using the adaptive gradient updating algorithm (SAdaOAM) is summarized in Algorithm 2.

From the Algorithm 2, it is observed that we perform “lazy” computation when the gradient vectors are sparse [10]. Suppose that, from iteration step to , the -th component of the gradient is “0”. Then, we can evaluate the updates on demand since remains intact. Therefore, at iteration step when is needed, the update will be

 wt,i=sign(wt0,i)[|wt0,i|−ηθHt0,ii(t−t0)]+,

where means . Obviously, this type of ”lazy” updates enjoys high efficiency.

## 4 Theoretical Analysis

In this section, we provide the regret bounds for the proposed set of AdaOAM algorithms for handling both regular and high-dimensional sparse data, respectively.

### 4.1 Regret Bounds with Regular Data

Firstly, we introduce two lemmas as follows, which will be used to facilitate our subsequent analyses.

###### Lemma 1.

Let , and be defined same in the Algorithm 1. Then

 \vspace−0.1inT∑t=1⟨gt,diag(st)−1gt⟩≤2d∑i=1∥g1:T,i∥2.
###### Lemma 2.

Let the sequence , be generated by the composite mirror descent update in Equation (12) and assume that . Using learning rate , for any optimal , the following bound holds

 \vspace−0.1inT∑t=1[Lt(wt)−Lt(w∗)]≤√2D∞d∑i=1∥g1:T,i∥2.

These two lemmas are actually the Lemma 4 and Corollary 1 in the paper [10].

Using these two lemmas, we can derive the following theorem for the proposed AdaOAM algorithm.

###### Theorem 1.

Assume and the diameter of is bounded via , we have

 \vspace−0.1inT∑t=1[Lt(wt)−Lt(w∗)]≤2D∞d∑i=1√T∑t=1[(λwt,i)2+C(rt,i)2],

where , and .

###### Proof.

We first define as

Based on the regularizer , it is easy to obtain due to the strong convexity property, and it is also reasonable to restrict with . Denote the projection of a point onto according to norm by , the AdaOAM actually employs the following update rule:

 wt+1=ΠHt1√λ(wt−ηH−1tgt), (10)

where and .

If we denote , and the dual norm of by , in which case , then it is easy to check the update rule 10 is the same with the following composite mirror descent method:

 wt+1=argmin∥w∥≤1√λ{η⟨gt,w⟩+ηφ(w)+Bψt(w,wt)}, (11)

where the regularization function , and is the Bregman divergence associated with a strongly convex and differentiable function

 Bψt(w,wt)=ψt(w)−ψt(wt)−⟨∇ψt(wt),w−wt⟩.

Since we have in the case of regular data, the regret bound . Then, we follow the derivation results of [10] and attain the following regret bound

 T∑t=1[Lt(wt)−Lt(w∗)]≤√2D∞d∑i=1∥g1:T,i∥2,

where is bounded via . Next, we would like to analyze the features’ dependency on the data of the gradient. Since

 (gt,i)2≤[λwt,i+t−1∑j=1(1−yt⟨xt−xj,w⟩)yt(xj,i−xt,i)T−t]2 ≤2(λwt,i)2+2C(xj,i−xt,i)2=2(λwt,i)2+2C(rt,i)2,

where is a constant to bound the scalar of the second term in the right side of the equation, and with , we have

 d∑i=1∥g1:T,i∥2 = d∑i=1 ⎷T∑t=1(gt,i)2 ≤ √2d∑i=1 ⎷T∑t=1[(λwt,i)2+C(rt,i)2].

Finally, combining the above inequalities, we arrive at

 T∑t=1[Lt(wt)−Lt(w∗)]≤2D∞d∑i=1 ⎷T∑t=1[(λwt,i)2+C(rt,i)2].

From the proof above, we can conclude that Algorithm 1 should have a lower regret than non-adaptive algorithms due to its dependence on the geometry of the underlying data space. If the features are normalized and sparse, the gradient terms in the bound should be much smaller than , which leads to lower regret and faster convergence. If the feature space is relative dense, then the convergence rate will be for the general case as in OPAUC and OAM methods.

### 4.2 Regret Bounds with High-dimensional Sparse Data

###### Theorem 2.

Assume and the diameter of is bounded via , the regret bound with respect to regularization term is

 \vspace−0.1inT∑t=1[Lt(wt)+θ∥wt∥1−Lt(w∗)−θ∥w∗∥1] ≤2D∞d∑i=1 ⎷T∑t=1[(λwt,i)2+C(rt,i)2],

where , , and .

###### Proof.

In the case of high-dimensional sparse adaptive online AUC maximization, the regret we plan to bound with respect to the optimal weight is formulated as

 R(T)=T∑t=1[Lt(wt)+φ(wt)−Lt(w∗)−φ(w∗)],

where is the regularization term to impose sparsity to the solution. Similarly, if denote , and the dual norm of by , in which case , it is easy to check the updating rule of SAdaOAM

 wt+1,i=sign(wt,i−ηHt,iigt,i)[|wt,i−ηHt,iigt,i|−ηθHt,ii]+,

is the same with the following one

 wt+1=argmin∥w∥≤1√λ{η⟨gt,w⟩+ηφ(w)+Bψt(w,wt)}, (12)

where . From [10], we have

 R(T)≤12ηmaxt≤T∥w∗−wt∥2∞d∑i=1∥g1:T,i∥2+ηd∑i=1∥g1:T,i∥2.

Furthermore, we assume and set , the final regret bound is

 R(T)≤√2D∞d∑i=1∥g1:T,i∥2.

This theoretical result shows that the regret bound for sparse solution is the same as that in the case when . ∎

As discussed above, the SAdaOAM algorithm should have lower regret bound than non-adaptive algorithms do on high-dimensional sparse data, though this depends on the geometric property of the underlying feature distribution. If some features appear much more frequently than others, then indicates that we could have remarkably lower regret by using higher learning rates for infrequent features and lower learning rates for often occurring features.

## 5 Experimental Results

In this section, we evaluate the proposed set of the AdaOAM algorithms in terms of AUC performance, convergence rate, and examine their parameter sensitivity. The main framework of the experiments is based on the LIBOL, an open-source library for online learning algorithms  [32].

### 5.1 Comparison Algorithms

We conduct comprehensive empirical studies by comparing the proposed algorithms with various AUC optimization algorithms for both online and batch scenarios. Specifically, the algorithms considered in our experiments include:

• Online Uni-Exp: An online learning algorithm which optimizes the (weighted) univariate exponential loss [33];

• Online Uni-Log: An online learning algorithm which optimizes the (weighted) univariate logistic loss [33];

• OAM: The OAM algorithm with reservoir sampling and sequential updating method [6];

• OAM: The OAM algorithm with reservoir sampling and online gradient updating method [6];

• OPAUC: The one-pass AUC optimization algorithm with square loss function [7];

• SVM-perf: A batch algorithm which directly optimizes AUC [4];

• CAPO: A batch algorithm which trains nonlinear auxiliary classifiers first and then adapts auxiliary classifiers for specific performance measures [34];

• Batch Uni-Log: A batch algorithm which optimizes the (weighted) univariate logistic loss [33];

• Batch Uni-Squ: A batch algorithm which optimizes the (weighted) univariate square loss;