Robust Online Multi-Task Learning with Correlative and Personalized Structures

Robust Online Multi-Task Learning with Correlative and Personalized Structures

Peng Yang,  P. Yang and X. Gao are with Computer, Electrical and Mathematical Sciences & Engineering Division at King Abdullah University of Science and Technology, Saudi Arabia. E-mail:peng.yang.2, xin.gao@kaust.edu.sa    Peilin Zhao,  P. Zhao is a senior algorithm expert in Ant Financial Service Group, China. E-mail: peilinzhao@hotmail.com    Xin Gao  X. Gao and P. Zhao are corresponding authors.
Abstract

Multi-Task Learning (MTL) can enhance a classifier’s generalization performance by learning multiple related tasks simultaneously. Conventional MTL works under the offline or batch setting, and suffers from expensive training cost and poor scalability. To address such inefficiency issues, online learning techniques have been applied to solve MTL problems. However, most existing algorithms of online MTL constrain task relatedness into a presumed structure via a single weight matrix, which is a strict restriction that does not always hold in practice. In this paper, we propose a robust online MTL framework that overcomes this restriction by decomposing the weight matrix into two components: the first one captures the low-rank common structure among tasks via a nuclear norm and the second one identifies the personalized patterns of outlier tasks via a group lasso. Theoretical analysis shows the proposed algorithm can achieve a sub-linear regret with respect to the best linear model in hindsight. Even though the above framework achieves good performance, the nuclear norm that simply adds all nonzero singular values together may not be a good low-rank approximation. To improve the results, we use a log-determinant function as a non-convex rank approximation. The gradient scheme is applied to optimize log-determinant function and can obtain a closed-form solution for this refined problem. Experimental results on a number of real-world applications verify the efficacy of our method.

{keywords}

artificial intelligence, learning systems, online learning, multitask learning, classification.

1 Introduction

Multi-Task Learning (MTL) aims to enhance the overall generalization performance by learning multiple related tasks simultaneously. It has been extensively studied from various points of view [1, 2, 3, 4]. As an example, the common tastes of users (i.e., tasks) with respect to movies (i.e., instances) can be harnessed into a movie recommender system using MTL [5]. Most MTL methods run under the offline learning setting where the training data for each task is available beforehand. However, offline learning methods are generally inefficient, since they suffer from a high training cost and poor scalability. This is especially true when it comes to the large-scale streaming data. As a remedy, MTL has been studied under the online setting, in which the model runs over a sequence of data by processing them one by one [6]. After updating the model in each round, the current input will be discarded. As a result, online learning algorithms are efficient and scalable, and have been successfully applied to a number of MTL applications [7, 8, 9, 10, 11].

In this paper, we investigate MTL under the online setting. Existing online MTL methods assume that all tasks are related with each other and simply constrain their relationships via a presumed structure [7, 12]. However, such a constraint may be too restrictive and rarely hold in the real-life applications, as the personalized tasks with individual traits often exist [13]. We attempt to address this drawback through a creative formulation of online MTL that consists of two components: the first component captures a low-rank correlative structure over the related tasks, while the second one represents the personalized patterns of individual tasks.

Specifically, our algorithm learns a weight matrix which is decomposed into two components as aforementioned. A nuclear norm regularization is imposed on the first component to induce a low-rank correlative structure of the related tasks. A group lasso penalty is applied onto the second component of all individual tasks to identify the outliers. Next, we apply an online projected gradient scheme to solve this non-smooth problem with a closed-form solution for the correlative and personalized components. This gives our algorithm two advantages: 1) it is efficient to make predictions and update models in a real-time manner; 2) it can achieve a good trade-off between the common and personalized structures. We provide a theoretical evaluation for our algorithm by giving a proof that our algorithm can achieve a sub-linear regret compared to the best linear model in hindsight.

Although our algorithm achieves good performance, it may not accurately approximate a low-rank matrix: the nuclear norm is essentially the norm of singular values, known for being biased in estimation since large singular values are detrimental to the approximation. To address this issue, we use a log-determinant function to approximate the matrix rank, that is able to reduce the contributions of large singular values while keeping those of small singular values close to zero. To solve this non-convex optimization problem, a proximal gradient algorithm is derived to adaptively learn such a low-rank structure with a closed-form solution. In addition, we prove that there is a unique root of the refined objective under a proper parameter setting. Finally, we conduct comparative experiments against a variety of state-of-the-art techniques on three real-world datasets. Empirically, the refined algorithm with the log-determinant function achieves better performance than that with the nuclear norm due to a better low-rank approximation.

The rest of this paper is organized as follows. Section 2 introduces related work. The problem setting and the proposed algorithm with analysis are presented in Section 3 and Section 4, respectively. Section 5 provides experimental results. Section 6 concludes this paper.

2 Related Work

In this section, we briefly introduce works related to MTL in the offline and online settings, followed by the low-rank matrix approximation.

Multi-Task Learning

Conventional offline or batch MTL algorithms can be broadly classified into the following two categories: explicit parameter sharing and implicit parameter sharing. In the first category, all tasks can be made to share some common parameters explicitly. Such common parameters include hidden units in neural networks [14], prior in hierarchical Bayesian models [15, 16], feature mapping matrix [17] and classification weight [18]. On the other side, the shared structure can be estimated in an implicit way by imposing a low rank subspace [19, 20], e.g. Trace-norm Regularized Multi-task Learning (TRML) [21] captured a common low-dimensional subspace of task relationship with a trace-norm regularization; or a common set of features [22, 23], e.g. Multi-Task Feature Learning (MTFL) [24] learned a common feature across the tasks in an unsupervised manner. Besides, [25] and [26] proposed a few experts to learn the task relatedness on the entire task set. These MTL techniques have been successfully used in the real-world applications, e.g. multi-view action recognition [27], spam detection [28], head pose estimation [29], etc.

Compared to the offline learning, online learning techniques are more efficient and suitable to handle massive and sequential data [30, 31, 32]. An early work  [33, 34], Online Multi-Task Learning (OMTL), studied online learning of multiple tasks in parallel. It exploited the task structure by using a global loss function. Another work [35] proposed a collaborative online framework, Confidence-weighted Collaborative Online Multi-task Learning (CW-COL), which learned the take relativeness via combining the individual and global variations of online Passive-Aggressive (PA) algorithms [36]. Instead of fixing the task relationship via a presumed structure [12], a recent Online Multi-Task Learning approach introduced an adaptive interaction matrix which quantified the task relevance with LogDet Divergence (OMTLLOG) and von-Neumann Divergence (OMTLVON) [7], respectively. Most Recently, [37] proposed an algorithm, Shared Hypothesis model (SHAMO), which used a K-means-like procedure to cluster different tasks in order to learn the shared hypotheses. Similar to SHAMO, [38] proposed an Online Smoothed Multi-Task Learning with Exponential updates (OSMTL-e). It jointly learned both the per-task model parameters and the inter-task relationships in an online MTL setting. The algorithm presented in this paper differs from existing ones in that it can learn both a common structure among the correlative tasks and the individual structure of outlier tasks.

Low-Rank Matrix Approximation

In many areas (e.g. machine learning, signal and image processing), high-dimensional data are commonly used. Apart from being uniformly distributed, high-dimensional data often lie on the low-dimensional structures. Recovering the low-dimensional subspace can well preserve and reveal the latent structure of the data. For example, face images of an individual under different lighting conditions span a low-dimensional subspace from an ambient high-dimensional space [39]. To learn low-dimensional subspaces, recently proposed methods, such as Low-Rank Representation (LRR) [40] and Low-Rank Subspace and Clustering (LRSC) [41], usually depended on the nuclear norm as a convex rank approximation function to seek low-rank subspaces. Unlike the rank function that treats them equally, the nuclear norm simply adds all nonzero singular value together, where the large values may contribute exclusively to the approximation, rendering it much deviated from the true rank. To resolve this problem, we propose a log-determinant function to approximate the rank function, which is able to reduce the contributions of large singular values while keeping those of small singular values close to zero. To the best of our knowledge, this is the first work that exploits a log-determinant function to learn a low-rank structure of task relationship in the online MTL problem.

3 Problem Setting

In this section, we first describe our notations, followed by the problem setting of online MTL.

3.1 Notations

Lowercase letters are used as scalars, lowercase bold letters as vectors, uppercase letters as elements of a matrix, and bold-face uppercase letters as matrices. and denote the -th column and the -th element of a matrix , respectively. Euclidean and Frobenius norms are denoted by and , respectively. In particular, for every , we define the -norm of as . When the function is differentiable, we denote its gradient by .

3.2 Problem Setting

According to the online MTL setting, we are faced with different but related classification problems, also known as tasks. Each task has a sequential instance-label pairs, i.e., , where is a feature vector drawn from a single feature space shared by all tasks, and . The algorithm maintains separate models in parallel, one for each of the tasks. At the round , instances are presented at one time. Given the -th task instance , the algorithm predicts its label using a linear model , i.e., , where and is the weight parameter of the round . The true label is not revealed until then. A hinge-loss function is applied to evaluate the prediction,

(1)

where . The cumulative loss over all tasks at the round is defined as

(2)

where is the weight matrix for all tasks. Inspired by the Regularized Loss Minimization (RLM) in which one minimizes an empirical loss plus a regularization term jointly [42], we formulate our online MTL to minimize the regret compared to the best linear model in hindsight,

where is a closed convex subset and the regularizer is a convex regularization function that constraints into simple sets, e.g. hyperplanes, balls, bound constraints, etc. For instance, constrains into a sparse matrix.

Fig. 1: Learning Personalized and Low-rank Structures from Multiple Tasks

4 Algorithm

We propose to solve the regret by two steps: 1) to learn the correlative and personalized patterns over multiple tasks; 2) to achieve an optimal solution for the regret .

4.1 Correlative and Personalized Structures

We propose a novel formulation for online MTL that incorporates two components, as illustrated in Fig. 1. The first component captures a low-rank common structure over the similar tasks, where one model (or pattern) can be shared cross the related tasks. As outlier tasks often exist in real-world scenarios, the second one, , identifies the personalized patterns specific to individual tasks. Thus, incorporation of two structures and could make the final model more robust and reliable.

To learn both correlative and personalized structures from multiple tasks, we decompose the weight matrix into two components: correlative matrix and personalized matrix , and define a new weight matrix,

(3)

where is the -th column of the weight matrix . Denoted by matrix the summation of and , we obtain

(4)

where is an identity matrix. Given an instance , the algorithm makes prediction based on both the correlative and personalized parameters,

(5)

with the corresponding loss function,

We thus can reformat the cumulative loss function with respect to ,

(6)

We impose a regularizer on and , respectively,

(7)

where and are non-negative trade-off parameters. Substituting Eq. (6) and (7) into the regret , it can be formatted as

(8)

where is a non-smooth convex function. We next show how to achieve an optimal solution to the reformatted regret (8).

4.2 Online Task Relationship Learning

Inspired by [43], we can solve the regret (8) by a subgradient projection,

(9)

where is the learning rate. In the following lemma, we show that the problem (9) can be turned into a linearized version of the proximal algorithm [44]. To do so, we first introduce a Bregman-like distance function [45],

where is a differentiable and convex function.

Lemma 1

Assume , then using first-order Taylor expansion of , the algorithm (9) is equivalent to a linearized form with a step-size parameter ,

Instead of balancing this trade-off individually for each of the multiple tasks, we balance it for all the tasks jointly. However, the subgradient of a composite function, i.e. cannot lead to a desirable effect, since we should constrain the projected gradient (i.e. ) into a restricted set. To address this issue, we refine the optimization function by adding a regularizer on ,

(10)

Note that the formulation (10) is different from the Mirror Descent (MD) algorithm [46], since we do not linearize the regularizer .

Given that , we show that the problem (10) can be presented with and in the lemma below.

Lemma 2

Assume that and , the problem (10) turns into an equivalent form in terms of and ,

(11)

where the parameters and control previous learned knowledge retained by and .

{proof}

Assume that , we obtain:

The linearized gradient form can be rewritten as:

Substituting above two inferences and (7) into problem (10), we complete this proof. We next introduce the regularization and , and then present how to optimize this non-smooth convex problem with a closed-form solution.

4.3 Regularization

As mentioned above, restricting task relatedness to a presumed structure via a single weight matrix [7] is too strict and not always plausible in practical applications. To overcome this problem, we thus impose a regularizer on and as follows,

(12)

A nuclear norm [19] is imposed on (i.e., ) to represent multiple tasks () by a small number (i.e. ) of the basis (). Intuitively, a model performing well on one task is likely to perform well on the similar tasks. Thus, we expect that the best model can be shared across several related tasks. However, the assumption that all tasks are correlated may not hold in real applications. Thus, we impose the -norm [47] on (i.e., ), which favors a few non-zero columns in the matrix to capture the personalized tasks.

Note that our algorithm with the regularization terms above is able to detect personalized patterns, unlike the algorithms [25, 26, 48]. Although prior work [13] considers detecting the personalized task, it was designed for the offline setting, which is different from our algorithm since we learn the personalized pattern adaptively with online techniques.

4.3.1 Optimization

Although the composite problem (11) can be solved by [49], the composite function with linear constraints has not been investigated to solve the MTL problem. We employ a projected gradient scheme [50, 51] to optimize this problem with both smooth and non-smooth terms. Specifically, by substituting (12) into (11) and omitting the terms unrelated to and , the problem can be rewritten as a projected gradient schema,

where

Due to the decomposability of the objective function above, the solution for and can be optimized separately,

(13)
(14)

This has two advantages: 1) there is a closed-form solution for each update; 2) the update for the and can be performed in parallel.

Computation of U: Inspired by [50], we show that the optimal solution to (13) can be obtained via solving a simple convex optimization problem in the following theorem.

Theorem 1

Denote by the eigendecomposition of where rank, , , and diag. Let be the solution of the following problem,

(15)

It is easy to obtain the optimal solution for (15): for . Assume that diag, the optimal solution to Eq. (13) is given by,

(16)

Computation of V: We rewrite (14) by solving an optimization problem for each column,

(17)

where denotes the -th column of . The optimal operator problem (17) above admits a closed-form solution with time complexity of  [52],

(18)

We observe that would be retained if , otherwise, it decays to 0. Hence, we infer that only the personalized patterns among the tasks, which differs from the low-rank common structures and thus cannot be captured by , would be retained in .

The two quantities and can be updated according to a closed-form solution on each round . A mistake-driven strategy is used to update the model. Finally, this algorithm, which we call Robust Online Multi-tasks learning under Correlative and persOnalized structures with NuClear norm term (ROMCO-NuCl), is presented in Alg. 1.

1:  Input: a sequence of instances , and the parameter , , and .
2:  Initialize: for ;
3:  for  do
4:     for  do
5:        Receive instance pair ( );
6:        Predict ;
7:        Compute the loss function ;
8:     end for
9:     if  then
10:        Update with Eq. (16);
11:        Update with Eq. (18);
12:     else
13:         and ;
14:     end if
15:  end for
16:  Output: for
Algorithm 1 ROMCO-NuCl

4.4 Log-Determinant Function

Fig. 2: The rank, nuclear norm and log-determinant objectives in the scalar case

While the nuclear norm has been theoretically proven to be the tightest convex approximation to the rank function, it is usually difficult to theoretically prove whether the nuclear norm near-optimally approximates the rank function, e.g., the incoherence property [53][54]. In addition, the nuclear norm may not accurately approximate the rank function in practice, since the matrix ”rank” regards all nonzero singular values to have equal contributions, i.e., regarding all positive singular values as ”1”, as shown by the red line in Fig. 2; while the nuclear norm, as shown by the yellow star line in Fig. 2, treats the nonzero singular values differently, i.e., it simply adds all nonzero values together, thus the larger singular values make more contribution to the approximation.

To solve this issue, we introduce a log-determinant function as follows,

Definition 1

Let () be the singular values of the matrix ,

  • When , the term , which is the same as the true rank function;

  • When , , implying that those small singular values can be reduced further;

  • For those large singular values , , which is a significant reduce over large singular values.

In this case, approximates the rank function better than the nuclear norm by significantly reducing the weights of large singular values, meanwhile regarding those very small singular values as noise, as presented with the blue circle line in Fig. 2.

4.4.1 Optimal Solution

Replacing the nuclear norm with the log-determinant function , the minimization of is reduced to the following problem:

(19)

To solve the objective function above, we show that the optimal solution could be obtained by solving the roots of a cubic equation in the following theorem.

Theorem 2

Let where diag and rank. Let be the solution of the following problem,

(20)

Then the optimal solution to Eq. (19), similar to Thm. 1, is given by , where diag and is the optimal solution of (20) . To obtain the solution, the problem is reduced to solving the derivative of Eq. (20) for each with ,

(21)
1:  Input: a sequence of instances , and the parameter , , and .
2:  Initialize: for ;
3:  for  do
4:     for  do
5:        Receive instance pair ( );
6:        Predict ;
7:        Compute the loss function ;
8:     end for
9:     if  then
10:        Update by solving Eq. (21);
11:        Update with Eq. (18);
12:     else
13:         and ;
14:     end if
15:  end for
16:  Output: for
Algorithm 2 ROMCO-LogD

In general, the equation (21) has three roots. The details of root computation is given in Appendix. In addition, in the following proposition, we prove that a unique positive root for (21) can be obtained in a certain parameter setting.

Proposition 1

Assume that with , when , , under the condition that and , located in is the unique positive root in cubic Eq. (21).

{proof}

We need to minimize Eq. (20) under the constraint of . The derivative of is

and the second derivative is

Case 1: If , because , we have . That is, is nondecreasing for any and strictly increasing with . Thus, the minimizer of is .
Case 2: If , then the roots exist only in a region of to let , since is monotonic with , and while .

  • If , then since . In this case, is a strictly convex function with a unique root in . Thus, the proposition is proven.

  • If , we determine the minimizer in the following way: Denote the set of positive roots of Eq. (21) by . By the first-order necessary optimality condition, the minimizer needs to be chosen from , that is, .

In our experiments, we initialize and increase its value in each iteration. Therefore, when , the minimizer is the unique positive root of (21); when , .

We are ready to present the algorithm: RMOCO with Log-Determinant function for rank approximation, namely ROMCO-LogD, which also exploits a mistake-driven update rule. We summarize ROMCO-LogD in Alg. 2. To the best of our knowledge, this is the first work that proposes a log-determinant function to learn a low-rank structure of task relationship in the online MTL problem. In the next section, we will theoretically analyze the performance of the proposed online MTL algorithms ROMCO-NuCl/LogD.

5 Theoretical Analysis

We next evaluate the performance of our online algorithm ROMCO-NuCl/LogD in terms of the regret bound. We first show the regret bound of the algorithm (10) and its equivalent form in the following lemma, which is essentially the same as Theorem 2 in the paper [55]:

Lemma 3

Let be updated according to (10). Assume that is -strongly convex w.r.t. a norm and its convex conjugate with , then for any ,

Remark 1

We show that the above regret is with respect to the best linear model in hindsight. Suppose that the functions are Lipschitz continuous, then such that . Then we obtain:

We also assume that . Then by setting , we have . Given that is constant, setting , we have .

Lemma 4

The general optimization problem (10) is equivalent to the two step process of setting:

{proof}

The optimal solution to the first step satisfies , so that

(22)

Then look at the optimal solution for the second step. For some , we have

(23)

Substituting Eq. (22) into Eq. (23), we obtain

which satisfies the optimal solution to the one-step update of (10).

We next show that ROMCO can achieve a sub-linear regret in the following theory.

Theorem 3

The algorithm ROMCO (Alg. 1 and Alg. 2) runs over a sequence of instances for each of the tasks. Assume that , i.e., and , , then the following inequality holds for all ,

{proof}

Let , according to Lemma 4, the solutions in subgradient projection (13) and (14) are equivalent to the one in form of the general optimization (10). Based on Lemma 3, for any ,

Because that (i.e., ), . Assuming , we obtain and . Thus,

By setting , we have .

6 Experimental Results

We evaluate the performance of our algorithm on three real-world datasets. We start by introducing the experimental data and benchmark setup, followed by discussions on the results of three practical applications.

6.1 Data and Benchmark Setup

Spam Email MHC-I EachMovie
#Tasks 4 12 30
#Sample 7068 18664 6000
#Dimesion 1458 400 1783
#MaxSample 4129 3793 200
#MinSample 710 415 200
TABLE I: Statistics of three datasets

6.1.1 Experimental Datasets and Baseline

We used three real-world datasets to evaluate our algorithm: Spam Email111http://labs-repos.iit.demokritos.gr/skel/i-config/, Human MHC-I222http://web.cs.iastate.edu/ honavar/ailab/ and EachMovie 333http://goldberg.berkeley.edu/jester-data/. Table I summarizes the statistics of three datasets. Each of the datasets can be converted to a list of binary-labeled instances, on which binary classifiers could be built for the applications of the three real-world scenarios: Personalized Spam Email Filtering, MHC-I Binding Prediction, and Movie Recommender System.

We compared two versions of the ROMCO algorithms with two batch learning methods: multi-task feature learning (MTFL[24] and trace-norm regularized multi-task learning (TRML) [21], as well as six online learning algorithms: online multi-task learning (OMTL) [34], online passive-aggressive algorithm (PA) [36], confidence-weighted online collaborative multi-task learning (CW-COL) [35] and three recently proposed online multi-task learning algorithms: OMTLVON, OMTLLOG [7] and OSMTL-e [38]. Due to the expensive computational cost in the batch models, we modified the setting of MTFL and TRML to handle online data by periodically retraining them after observing 100 samples. All parameters for MTFL and TRML were set by default values. To further examine the effectiveness of the PA algorithm, we deployed two variations of this algorithm as described below: PA-Global learns a single classification model from data of all tasks; PA-Unique trains a personalized classifier for each task using its own data. The parameter C was set to 1 for all related baselines and the ROMCO algorithms. Other parameters for CW-COL, OMTLVON(OMTLLOG) and OSMTL-e were tuned with a grid search on a held-out random shuffle. The four parameters, , , and for ROMCO-NuCl/LogD, were tuned by the grid search ,…, on a held-out random shuffle.

Algorithm User1 User2 User3 User4
Error Rate
Legit F1
Error Rate
Legit F1
Error Rate
Legit F1
Error Rate
Legit F1
MTFL 13.16(1.21) 88.61(1.06) 8.72(0.58) 94.67(0.38) 14.84(0.91) 86.70(0.88) 16.87(0.78) 83.59(0.77)
TRML 17.71(0.99) 84.61(0.88) 12.45(0.94) 92.24(0.65) 13.89(0.73) 87.57(0.66) 20.78(1.08) 79.67(1.09)
PA-Global 6.15(0.45) 94.54(0.41) 8.59(0.82) 94.72(0.51) 4.12(0.30) 96.33(0.27) 9.75(0.61) 90.20(0.64)
PA-Unqiue 5.05(0.49) 95.51(0.44) 8.28(0.85) 94.91(0.52) 3.67(0.36) 96.73(0.32) 8.43(0.86) 91.52(0.89)
CW-COL 5.10(0.61) 95.45(0.55) 6.58(0.60) 95.91(0.38) 4.14(0.12) 96.29(0.10) 7.95(0.73) 92.08(0.73)
OMTL 5.00(0.48) 95.55(0.43) 8.01(0.77) 95.07(0.48) 3.55(0.29) 96.84(0.26) 8.24(0.73) 91.71(0.75)
OSMTL-e 7.34(0.69) 93.46(0.59) 10.25(0.89) 93.61(0.57) 5.86(0.69) 94.74(0.61) 10.42(1.18) 89.68(1.07)
OMTLVON 18.88(5.70) 85.55(3.70) 19.76(0.10) 89.04(0.06) 3.54(0.30) 96.85(0.29) 10.54(2.47) 89.52(2.97)
OMTLLOG 4.81(0.36) 95.73(0.32) 7.58(0.65) 95.36(0.39) 2.91(0.18) 97.41(0.16) 7.16(0.53) 92.87(0.51)
ROMCO-NuCl 4.12(0.50) 96.34(0.44) 7.06(0.49) 95.68(0.30) 2.87(0.42) 97.43(0.38) 6.85(0.68) 93.23(0.66)
ROMCO-LogD 4.00(0.42) 96.45(0.37) 7.31(0.45) 95.55(0.27) 2.74(0.16) 97.56(0.14) 6.68(0.44) 93.40(0.43)
TABLE II: Cumulative error rate (%) and F1-measure (%) with their standard deviation in the parenthesis on the Spam Email Dataset

6.1.2 Evaluation Metric

We evaluated the performance of the aforementioned algorithms by two metrics: 1) Cumulative error rate: the ratio of predicted errors over a sequence of instances. It reflects the prediction accuracy of online learners. 2) F1-measure: the harmonic mean of precision and recall. It is suitable for evaluating the learner’s performance on class-imbalanced datasets. We followed the method of [56] by randomly shuffling the ordering of samples for each dataset and repeating the experiment 10 times with new shuffles. The average results and its standard deviation are reported below.

Algorithm Spam Email MHC-I EachMovie
TRML 73.553 361.42 391.218
MTFL 78.012 198.90 302.170
PA-Global 0.423 1.79 23.337
PA-Unique 0.340 1.53 26.681
CW-COL 0.86 4.35 31.002
OMTL 26.428 40.314 85.105
OSMTL-e 0.360 2.327 11.43
OMTLVON 1.230 1.785 21.586
OMTLLOG 1.145 1.371 20.232
ROMCO-NuCl 11.49 4.88 33.235
ROMCO-LogD 10.59 5.352 28.716
TABLE III: Run-time (in seconds) for each algorithm

6.2 Spam Email Filtering

Fig. 3: Cumulative error rate on the Email Spam dataset along the entire online learning process

We applied online multi-task learning to build effective personalized spam filters. The task is to classify each new incoming email massage into two categories: legitimate or spam. We used a dataset hosted by the Internet Content Filtering Group. The dataset contains 7068 emails collected from mailboxes of four users (denoted by user1, user2, user3, user4). Basically, the set of all emails received by a user was not specifically generated for that user. However, the characteristic of each user’s email could be said to match user’s interest. Each mail entry was converted to a word document vector using the TF-IDF (term frequency-inverse document frequency) representation.

Since the email dataset had no time-stamp, each email list was shuffled into a random sequence. The cumulative error rate and F1-measure results of 10 shuffles are listed in Table II. In addition, the cumulative error rate for the four specific users along the learning process is presented in Fig. 3. We also report each algorithm’s run-time, i.e., the time consumed by both training and test phase during the complete online learning process in Table III. From these results, we can make several observations.

First, the proposed ROMCO-NuCl/LogD outperform other online learners in terms of the error rate and F1-measure. In particular, in accordance to the results from the four specific users, learning tasks collaboratively with both the common and personalized structures consistently beats both the global model and the personalized model.

Second, the performance of the proposed online multi-task learning methods are better than that of the two batch learning algorithms (MTFL and TRML). It should be noted that compared to online learners which update models based only on the current samples, batch learning methods have the advantage of keeping a substantial amount of recent training samples, at the cost of storage space and higher complexity. In fact, the proposed ROMCO-NuCL/LogD are more efficient than the batch incremental methods, e.g., it could be more than 100 times faster than batch MTFL in large-sized dataset (28.72 secs versus 302.17 secs in EachMovie as shown in Table III). ROMCO-NuCL/LogD do not store recent training samples. They only use the current training sample and a simple rule to update the model. In contrast, batch learning algorithms need to keep a certain number of recent training samples in memory, learning to extra burden on storage and complexity. In addition, both MTFL and TRML need to solve an optimization problem in an iterative manner. For practical applications involving hundreds of millions of users and features, the batch learning algorithms are no longer feasible, while online learners remain highly efficient and scalable.

We also observed that ROMCO-NuCL/LogD are slightly slower than CW-COL, OMTLVON/LOG and OSMTL-e. This is expected as ROMCO-NuCL/LogD have to update two component weight matrices. However, the extra computational cost is worth considering the significant improvement over the two measurements achieved by using the two components.

6.3 MHC-I Binding Prediction

Algorithm Error Rate Positive Class Negative Class
F1-measure F1-measure
MTFL 43.84(6.05) 51.04(10.12) 59.12(7.35)
TRML 44.26(5.98) 50.50(9.97) 58.80(7.40)
PA-Global 44.70(2.68) 45.44(9.86) 61.28(3.17)
PA-Unqiue 41.62(3.95) 51.08(10.23) 63.02(2.89)
CW-COL 41.32(4.46) 50.89(10.70) 63.59(3.08)
OMTL 41.56(3.97) 51.13(10.25) 63.08(2.89)
OSMTL-e 42.78(0.65) 50.48(0.43) 61.59(0.96)
OMTLVON 38.13(5.03) 54.73(10.44) 66.39(4.02)
OMTLLOG 38.08(5.16) 54.83(10.56) 66.40(4.13)
ROMCO-NuCl 38.09(5.32) 55.03(10.53) 66.34(4.09)
ROMCO-LogD 37.91(4.97) 55.09(10.10) 66.55(4.14)
TABLE IV: Cumulative error rate (%) and F1-measure (%) and its Standard Deviation in the parenthesis on the MHC-I Dataset Results over 12 Tasks
Fig. 4: Cumulative error rate on the 12 tasks of MHC-I dataset along the entire online learning process

Computational methods have been widely used in bioinformatics to build models to infer properties from biological data [57, 58]. In this experiment, we evaluated several methods to predict peptide binding to human MHC (major histocom-patibility complex) class I molecules. It is known that peptide binding to human MHC-I molecules plays a crucial role in the immune system. The prediction of such binding has valuable application in vaccine designs, the diagnosis and treatment of cancer, etc. Recent work has demonstrated that there exists common information between related molecules (alleles) and such information can be leveraged to improve the peptide MHC-I binding prediction.

We used a binary-labeled MHC-I dataset. The data consists of 18664 peptide sequences for 12 human MHC-I molecules. Each peptide sequence was converted to a 400 dimensional feature vector following [35]. The goal is to determine whether a peptide sequence (instance) binds to a MHC-I molecule (task) or not, i.e., binder or non-binder.

We reported the average cumulative error rate and F1-measure of 12 tasks in Table IV. To make a clear comparison between the proposed ROMCO-NuCL/LogD and baselines, we showed the variation of their cumulative error rate along the entire online learning process averaged over the 10 runs in Fig. 4.

From these results, we first observed that the permutations of the dataset have little influence on the performance of each method, as indicated by the small standard deviation values in Table IV. Note that the majority of the dataset belongs to the negative class, thus predicting more examples as the majority class decreases the overall error rate, but also degrades the accuracy of the minority positive class. The consistently good performance achieved by the proposed ROMCO-NuCL/LogD in terms of the error rate and F1-measures of both classes further demonstrates effectiveness of our algorithms over imbalanced datasets. Moreover, among the six online models, learning related tasks jointly still achieves better performance than learning the tasks individually, as shown by the improvement of ROMCO and OMTL models over the PA-Unique model.

Fig. 5: Average error rate and F1-measure on the EachMovie dataset over 30 tasks along the entire online learning process
Algorithm Error Rate Positive Class Negative Class
F1-measure F1-measure
MTFL 27.51(12.25) 79.18(12.87) 36.06(14.85)
TRML 26.58(11.82) 79.89(12.49) 37.64(15.05)
PA-Global 31.80(5.87) 74.43(8.61) 47.96(14.47)
PA-Unqiue 19.68(7.39) 82.97(9.35) 57.80(21.05)
CW-COL 25.45(6.96) 78.89(9.30) 53.95(16.71)
OMTL 19.44(7.28) 83.18(9.29) 57.77(21.39)
OSMTL-e 20.73(7.15) 82.31(9.04) 58.76(18.71)
OMTLVON 18.61(7.29) 84.45(8.64) 55.92(23.94)
OMTLLOG 18.61(7.29) 84.45(8.64) 55.92(23.94)
ROMCO-NuCl 19.14(7.20) 83.46(9.25) 58.27(21.17)
ROMCO-LogD 18.21(6.71) 84.63(8.41) 55.53(25.02)
TABLE V: Average Cumulative Error rate (%) and F1-measure(%) with their standard deviation in the parenthesis over 30 tasks of EachMovie Dataset Results

6.4 Movie Recommender System

Fig. 6: Sensitivity analysis on the effect of the parameter and in terms of the error rate

In recent years, recommender systems have achieved great success in many real-world applications. The goal is to predict users’ preferences on targeted products, i.e., given partially observed user-movie rating entries in the underlying (ratings) matrix, we would like to infer their preference for unrated movies.

We used a dataset hosted by the DEC System Research Center that collected the EachMovie recommendation data for 18 months. During that time, 72916 users entered a total of 2811983 numeric ratings for 1628 different movies. We randomly selected 30 users (tasks) who viewed exactly 200 movies with their rating as the target classes. Given each of the 30 users, we then randomly selected 1783 users who viewed the same 200 movies and used their ratings as the features of the movies. The six possible ratings (i.e., ) were converted into binary classes (i.e., like or dislike) based on the rating order.

Table V shows the comparison results in terms of the average cumulative error rate and F1-measure. Fig. 5 depicts the detailed the cumulative error rate along the entire online learning process over the averaged 30 tasks of the EachMovie dataset. From these results, we can make several conclusions.

First, it can be seen that the proposed ROMCO-NuCl/LogD outperform other baselines: ROMCO-NuCl/LogD always provide smaller error rates and higher F1-measures compared to other baselines. It shows that our algorithms can maintain a high quality of prediction accuracy. We believe that the promising result is generally due to two reasons: First, the personalized and correlative patterns are effective to discover the personalized tasks and task relativeness, and these patterns are successfully captured in three real-world datasets. Second, once an error occurs in at least one task, ROMCO-NuCl/LogD would update the entire task matrix. That would benefit other related tasks with few learning instances since the shared subspaces would be updated accordingly.

Next, we observed that ROMCO-LogD is better than ROMCO-NuCl in Fig.  5 in terms of the error rate and F1-measure. This is expected because compared to the nuclear norm, ROMCO-LogD is able to achieve better rank approximation with the log-determinant function, i.e., it reduces the contribution of large singular values while approximating the small singular values into zeros.

6.5 Effect of the Regularization Parameters

We used Spam Email and Human MHC-I datasets as the cases for parameter sensitivity analysis. In the Spam Email dataset, by fixing as well as varying the value of in the tuning set, i.e., , we studied how the parameter affects the classification performance of ; by fixing as well as varying the value of in tuning set of , we study how the parameter affects the performance of ROMCO-NuCl/LogD. Similarly, in the Human MHC-I dataset, we studied the pair of by fixing with the tuning set of and by fixing with tuning set of . In Fig. 6, we show the classification performance of ROMCO in terms of the error rate for each pair of . From Fig. 6, we observed that the performance is worse with an increment of either or over the Spam Email dataset. It indicates a weak relativeness among the tasks and many personalized tasks existing in the Email dataset. In Human MHC-I, the poor performance is triggered by a small value of or a large value of . Compared with the Email data, MHC-I contains fewer personalized tasks, meanwhile most tasks are closely related and well represented by a low-dimensional subspace.

7 Conclusion

We proposed an online MTL method that can identify sparse personalized patterns for outlier tasks, meanwhile capture a shared low-rank subspace for correlative tasks. As an online technique, the algorithm can achieve a low prediction error rate via leveraging previously learned knowledge. As a multi-task approach, it can balance this trade-off by learning all the tasks jointly. In addition, we proposed a log-determinant function to approximate the rank of the matrix, which, in turn, achieves better performance than the one with the nuclear norm. We show that it is able to achieve a sub-linear regret bound with respect to the best linear model in hindsight, which can be regarded as a theoretical support for the proposed algorithm. Meanwhile, the empirical results demonstrate that our algorithms outperform other state-of-the-art techniques on three real-world applications. In future work, online active learning could be applied in the MTL scenario in order to save the labelling cost.

Appendix

root computation of the Log-Determinant function

To solve the log-determinant function, we set the derivative of Eq. (20) for each to zero with ,

Assume that , , and , we define that , where

Then the three possible roots of the above cubic equation include one real root and two complex roots,