Robust Online MultiTask Learning with Correlative and Personalized Structures
Abstract
MultiTask Learning (MTL) can enhance a classifier’s generalization performance by learning multiple related tasks simultaneously. Conventional MTL works under the offline or batch setting, and suffers from expensive training cost and poor scalability. To address such inefficiency issues, online learning techniques have been applied to solve MTL problems. However, most existing algorithms of online MTL constrain task relatedness into a presumed structure via a single weight matrix, which is a strict restriction that does not always hold in practice. In this paper, we propose a robust online MTL framework that overcomes this restriction by decomposing the weight matrix into two components: the first one captures the lowrank common structure among tasks via a nuclear norm and the second one identifies the personalized patterns of outlier tasks via a group lasso. Theoretical analysis shows the proposed algorithm can achieve a sublinear regret with respect to the best linear model in hindsight. Even though the above framework achieves good performance, the nuclear norm that simply adds all nonzero singular values together may not be a good lowrank approximation. To improve the results, we use a logdeterminant function as a nonconvex rank approximation. The gradient scheme is applied to optimize logdeterminant function and can obtain a closedform solution for this refined problem. Experimental results on a number of realworld applications verify the efficacy of our method.
artificial intelligence, learning systems, online learning, multitask learning, classification.
1 Introduction
MultiTask Learning (MTL) aims to enhance the overall generalization performance by learning multiple related tasks simultaneously. It has been extensively studied from various points of view [1, 2, 3, 4]. As an example, the common tastes of users (i.e., tasks) with respect to movies (i.e., instances) can be harnessed into a movie recommender system using MTL [5]. Most MTL methods run under the offline learning setting where the training data for each task is available beforehand. However, offline learning methods are generally inefficient, since they suffer from a high training cost and poor scalability. This is especially true when it comes to the largescale streaming data. As a remedy, MTL has been studied under the online setting, in which the model runs over a sequence of data by processing them one by one [6]. After updating the model in each round, the current input will be discarded. As a result, online learning algorithms are efficient and scalable, and have been successfully applied to a number of MTL applications [7, 8, 9, 10, 11].
In this paper, we investigate MTL under the online setting. Existing online MTL methods assume that all tasks are related with each other and simply constrain their relationships via a presumed structure [7, 12]. However, such a constraint may be too restrictive and rarely hold in the reallife applications, as the personalized tasks with individual traits often exist [13]. We attempt to address this drawback through a creative formulation of online MTL that consists of two components: the first component captures a lowrank correlative structure over the related tasks, while the second one represents the personalized patterns of individual tasks.
Specifically, our algorithm learns a weight matrix which is decomposed into two components as aforementioned. A nuclear norm regularization is imposed on the first component to induce a lowrank correlative structure of the related tasks. A group lasso penalty is applied onto the second component of all individual tasks to identify the outliers. Next, we apply an online projected gradient scheme to solve this nonsmooth problem with a closedform solution for the correlative and personalized components. This gives our algorithm two advantages: 1) it is efficient to make predictions and update models in a realtime manner; 2) it can achieve a good tradeoff between the common and personalized structures. We provide a theoretical evaluation for our algorithm by giving a proof that our algorithm can achieve a sublinear regret compared to the best linear model in hindsight.
Although our algorithm achieves good performance, it may not accurately approximate a lowrank matrix: the nuclear norm is essentially the norm of singular values, known for being biased in estimation since large singular values are detrimental to the approximation. To address this issue, we use a logdeterminant function to approximate the matrix rank, that is able to reduce the contributions of large singular values while keeping those of small singular values close to zero. To solve this nonconvex optimization problem, a proximal gradient algorithm is derived to adaptively learn such a lowrank structure with a closedform solution. In addition, we prove that there is a unique root of the refined objective under a proper parameter setting. Finally, we conduct comparative experiments against a variety of stateoftheart techniques on three realworld datasets. Empirically, the refined algorithm with the logdeterminant function achieves better performance than that with the nuclear norm due to a better lowrank approximation.
The rest of this paper is organized as follows. Section 2 introduces related work. The problem setting and the proposed algorithm with analysis are presented in Section 3 and Section 4, respectively. Section 5 provides experimental results. Section 6 concludes this paper.
2 Related Work
In this section, we briefly introduce works related to MTL in the offline and online settings, followed by the lowrank matrix approximation.
MultiTask Learning
Conventional offline or batch MTL algorithms can be broadly classified into the following two categories: explicit parameter sharing and implicit parameter sharing. In the first category, all tasks can be made to share some common parameters explicitly. Such common parameters include hidden units in neural networks [14], prior in hierarchical Bayesian models [15, 16], feature mapping matrix [17] and classification weight [18]. On the other side, the shared structure can be estimated in an implicit way by imposing a low rank subspace [19, 20], e.g. Tracenorm Regularized Multitask Learning (TRML) [21] captured a common lowdimensional subspace of task relationship with a tracenorm regularization; or a common set of features [22, 23], e.g. MultiTask Feature Learning (MTFL) [24] learned a common feature across the tasks in an unsupervised manner. Besides, [25] and [26] proposed a few experts to learn the task relatedness on the entire task set. These MTL techniques have been successfully used in the realworld applications, e.g. multiview action recognition [27], spam detection [28], head pose estimation [29], etc.
Compared to the offline learning, online learning techniques are more efficient and suitable to handle massive and sequential data [30, 31, 32]. An early work [33, 34], Online MultiTask Learning (OMTL), studied online learning of multiple tasks in parallel. It exploited the task structure by using a global loss function. Another work [35] proposed a collaborative online framework, Confidenceweighted Collaborative Online Multitask Learning (CWCOL), which learned the take relativeness via combining the individual and global variations of online PassiveAggressive (PA) algorithms [36]. Instead of fixing the task relationship via a presumed structure [12], a recent Online MultiTask Learning approach introduced an adaptive interaction matrix which quantified the task relevance with LogDet Divergence (OMTLLOG) and vonNeumann Divergence (OMTLVON) [7], respectively. Most Recently, [37] proposed an algorithm, Shared Hypothesis model (SHAMO), which used a Kmeanslike procedure to cluster different tasks in order to learn the shared hypotheses. Similar to SHAMO, [38] proposed an Online Smoothed MultiTask Learning with Exponential updates (OSMTLe). It jointly learned both the pertask model parameters and the intertask relationships in an online MTL setting. The algorithm presented in this paper differs from existing ones in that it can learn both a common structure among the correlative tasks and the individual structure of outlier tasks.
LowRank Matrix Approximation
In many areas (e.g. machine learning, signal and image processing), highdimensional data are commonly used. Apart from being uniformly distributed, highdimensional data often lie on the lowdimensional structures. Recovering the lowdimensional subspace can well preserve and reveal the latent structure of the data. For example, face images of an individual under different lighting conditions span a lowdimensional subspace from an ambient highdimensional space [39]. To learn lowdimensional subspaces, recently proposed methods, such as LowRank Representation (LRR) [40] and LowRank Subspace and Clustering (LRSC) [41], usually depended on the nuclear norm as a convex rank approximation function to seek lowrank subspaces. Unlike the rank function that treats them equally, the nuclear norm simply adds all nonzero singular value together, where the large values may contribute exclusively to the approximation, rendering it much deviated from the true rank. To resolve this problem, we propose a logdeterminant function to approximate the rank function, which is able to reduce the contributions of large singular values while keeping those of small singular values close to zero. To the best of our knowledge, this is the first work that exploits a logdeterminant function to learn a lowrank structure of task relationship in the online MTL problem.
3 Problem Setting
In this section, we first describe our notations, followed by the problem setting of online MTL.
3.1 Notations
Lowercase letters are used as scalars, lowercase bold letters as vectors, uppercase letters as elements of a matrix, and boldface uppercase letters as matrices. and denote the th column and the th element of a matrix , respectively. Euclidean and Frobenius norms are denoted by and , respectively. In particular, for every , we define the norm of as . When the function is differentiable, we denote its gradient by .
3.2 Problem Setting
According to the online MTL setting, we are faced with different but related classification problems, also known as tasks. Each task has a sequential instancelabel pairs, i.e., , where is a feature vector drawn from a single feature space shared by all tasks, and . The algorithm maintains separate models in parallel, one for each of the tasks. At the round , instances are presented at one time. Given the th task instance , the algorithm predicts its label using a linear model , i.e., , where and is the weight parameter of the round . The true label is not revealed until then. A hingeloss function is applied to evaluate the prediction,
(1) 
where . The cumulative loss over all tasks at the round is defined as
(2) 
where is the weight matrix for all tasks. Inspired by the Regularized Loss Minimization (RLM) in which one minimizes an empirical loss plus a regularization term jointly [42], we formulate our online MTL to minimize the regret compared to the best linear model in hindsight,
where is a closed convex subset and the regularizer is a convex regularization function that constraints into simple sets, e.g. hyperplanes, balls, bound constraints, etc. For instance, constrains into a sparse matrix.
4 Algorithm
We propose to solve the regret by two steps: 1) to learn the correlative and personalized patterns over multiple tasks; 2) to achieve an optimal solution for the regret .
4.1 Correlative and Personalized Structures
We propose a novel formulation for online MTL that incorporates two components, as illustrated in Fig. 1. The first component captures a lowrank common structure over the similar tasks, where one model (or pattern) can be shared cross the related tasks. As outlier tasks often exist in realworld scenarios, the second one, , identifies the personalized patterns specific to individual tasks. Thus, incorporation of two structures and could make the final model more robust and reliable.
To learn both correlative and personalized structures from multiple tasks, we decompose the weight matrix into two components: correlative matrix and personalized matrix , and define a new weight matrix,
(3) 
where is the th column of the weight matrix . Denoted by matrix the summation of and , we obtain
(4) 
where is an identity matrix. Given an instance , the algorithm makes prediction based on both the correlative and personalized parameters,
(5)  
with the corresponding loss function,
We thus can reformat the cumulative loss function with respect to ,
(6) 
We impose a regularizer on and , respectively,
(7) 
where and are nonnegative tradeoff parameters. Substituting Eq. (6) and (7) into the regret , it can be formatted as
(8) 
where is a nonsmooth convex function. We next show how to achieve an optimal solution to the reformatted regret (8).
4.2 Online Task Relationship Learning
Inspired by [43], we can solve the regret (8) by a subgradient projection,
(9) 
where is the learning rate. In the following lemma, we show that the problem (9) can be turned into a linearized version of the proximal algorithm [44]. To do so, we first introduce a Bregmanlike distance function [45],
where is a differentiable and convex function.
Lemma 1
Assume , then using firstorder Taylor expansion of , the algorithm (9) is equivalent to a linearized form with a stepsize parameter ,
Instead of balancing this tradeoff individually for each of the multiple tasks, we balance it for all the tasks jointly. However, the subgradient of a composite function, i.e. cannot lead to a desirable effect, since we should constrain the projected gradient (i.e. ) into a restricted set. To address this issue, we refine the optimization function by adding a regularizer on ,
(10)  
Note that the formulation (10) is different from the Mirror Descent (MD) algorithm [46], since we do not linearize the regularizer .
Given that , we show that the problem (10) can be presented with and in the lemma below.
Lemma 2
Assume that and , the problem (10) turns into an equivalent form in terms of and ,
(11)  
where the parameters and control previous learned knowledge retained by and .
4.3 Regularization
As mentioned above, restricting task relatedness to a presumed structure via a single weight matrix [7] is too strict and not always plausible in practical applications. To overcome this problem, we thus impose a regularizer on and as follows,
(12) 
A nuclear norm [19] is imposed on (i.e., ) to represent multiple tasks () by a small number (i.e. ) of the basis (). Intuitively, a model performing well on one task is likely to perform well on the similar tasks. Thus, we expect that the best model can be shared across several related tasks. However, the assumption that all tasks are correlated may not hold in real applications. Thus, we impose the norm [47] on (i.e., ), which favors a few nonzero columns in the matrix to capture the personalized tasks.
Note that our algorithm with the regularization terms above is able to detect personalized patterns, unlike the algorithms [25, 26, 48]. Although prior work [13] considers detecting the personalized task, it was designed for the offline setting, which is different from our algorithm since we learn the personalized pattern adaptively with online techniques.
4.3.1 Optimization
Although the composite problem (11) can be solved by [49], the composite function with linear constraints has not been investigated to solve the MTL problem. We employ a projected gradient scheme [50, 51] to optimize this problem with both smooth and nonsmooth terms. Specifically, by substituting (12) into (11) and omitting the terms unrelated to and , the problem can be rewritten as a projected gradient schema,
where
Due to the decomposability of the objective function above, the solution for and can be optimized separately,
(13) 
(14) 
This has two advantages: 1) there is a closedform solution for each update; 2) the update for the and can be performed in parallel.
Computation of U: Inspired by [50], we show that the optimal solution to (13) can be obtained via solving a simple convex optimization problem in the following theorem.
Theorem 1
Computation of V: We rewrite (14) by solving an optimization problem for each column,
(17) 
where denotes the th column of . The optimal operator problem (17) above admits a closedform solution with time complexity of [52],
(18) 
We observe that would be retained if , otherwise, it decays to 0. Hence, we infer that only the personalized patterns among the tasks, which differs from the lowrank common structures and thus cannot be captured by , would be retained in .
The two quantities and can be updated according to a closedform solution on each round . A mistakedriven strategy is used to update the model. Finally, this algorithm, which we call Robust Online Multitasks learning under Correlative and persOnalized structures with NuClear norm term (ROMCONuCl), is presented in Alg. 1.
4.4 LogDeterminant Function
While the nuclear norm has been theoretically proven to be the tightest convex approximation to the rank function, it is usually difficult to theoretically prove whether the nuclear norm nearoptimally approximates the rank function, e.g., the incoherence property [53][54]. In addition, the nuclear norm may not accurately approximate the rank function in practice, since the matrix ”rank” regards all nonzero singular values to have equal contributions, i.e., regarding all positive singular values as ”1”, as shown by the red line in Fig. 2; while the nuclear norm, as shown by the yellow star line in Fig. 2, treats the nonzero singular values differently, i.e., it simply adds all nonzero values together, thus the larger singular values make more contribution to the approximation.
To solve this issue, we introduce a logdeterminant function as follows,
Definition 1
Let () be the singular values of the matrix ,

When , the term , which is the same as the true rank function;

When , , implying that those small singular values can be reduced further;

For those large singular values , , which is a significant reduce over large singular values.
In this case, approximates the rank function better than the nuclear norm by significantly reducing the weights of large singular values, meanwhile regarding those very small singular values as noise, as presented with the blue circle line in Fig. 2.
4.4.1 Optimal Solution
Replacing the nuclear norm with the logdeterminant function , the minimization of is reduced to the following problem:
(19) 
To solve the objective function above, we show that the optimal solution could be obtained by solving the roots of a cubic equation in the following theorem.
Theorem 2
Let where diag and rank. Let be the solution of the following problem,
(20) 
Then the optimal solution to Eq. (19), similar to Thm. 1, is given by , where diag and is the optimal solution of (20) . To obtain the solution, the problem is reduced to solving the derivative of Eq. (20) for each with ,
(21) 
In general, the equation (21) has three roots. The details of root computation is given in Appendix. In addition, in the following proposition, we prove that a unique positive root for (21) can be obtained in a certain parameter setting.
Proposition 1
Assume that with , when , , under the condition that and , located in is the unique positive root in cubic Eq. (21).
We need to minimize Eq. (20) under the constraint of . The derivative of is
and the second derivative is
Case 1: If , because , we have . That is, is nondecreasing for any and strictly increasing with . Thus, the minimizer of is .
Case 2: If , then the roots exist only in a region of to let , since is monotonic with , and while .

If , then since . In this case, is a strictly convex function with a unique root in . Thus, the proposition is proven.

If , we determine the minimizer in the following way: Denote the set of positive roots of Eq. (21) by . By the firstorder necessary optimality condition, the minimizer needs to be chosen from , that is, .
In our experiments, we initialize and increase its value in each iteration. Therefore, when , the minimizer is the unique positive root of (21); when , .
We are ready to present the algorithm: RMOCO with LogDeterminant function for rank approximation, namely ROMCOLogD, which also exploits a mistakedriven update rule. We summarize ROMCOLogD in Alg. 2. To the best of our knowledge, this is the first work that proposes a logdeterminant function to learn a lowrank structure of task relationship in the online MTL problem. In the next section, we will theoretically analyze the performance of the proposed online MTL algorithms ROMCONuCl/LogD.
5 Theoretical Analysis
We next evaluate the performance of our online algorithm ROMCONuCl/LogD in terms of the regret bound. We first show the regret bound of the algorithm (10) and its equivalent form in the following lemma, which is essentially the same as Theorem 2 in the paper [55]:
Lemma 3
Let be updated according to (10). Assume that is strongly convex w.r.t. a norm and its convex conjugate with , then for any ,
Remark 1
We show that the above regret is with respect to the best linear model in hindsight. Suppose that the functions are Lipschitz continuous, then such that . Then we obtain:
We also assume that . Then by setting , we have . Given that is constant, setting , we have .
Lemma 4
The general optimization problem (10) is equivalent to the two step process of setting:
The optimal solution to the first step satisfies , so that
(22) 
Then look at the optimal solution for the second step. For some , we have
(23) 
Substituting Eq. (22) into Eq. (23), we obtain
which satisfies the optimal solution to the onestep update of (10).
We next show that ROMCO can achieve a sublinear regret in the following theory.
Theorem 3
6 Experimental Results
We evaluate the performance of our algorithm on three realworld datasets. We start by introducing the experimental data and benchmark setup, followed by discussions on the results of three practical applications.
6.1 Data and Benchmark Setup
Spam Email  MHCI  EachMovie  

#Tasks  4  12  30 
#Sample  7068  18664  6000 
#Dimesion  1458  400  1783 
#MaxSample  4129  3793  200 
#MinSample  710  415  200 
6.1.1 Experimental Datasets and Baseline
We used three realworld datasets to evaluate our algorithm: Spam Email^{1}^{1}1http://labsrepos.iit.demokritos.gr/skel/iconfig/, Human MHCI^{2}^{2}2http://web.cs.iastate.edu/ honavar/ailab/ and EachMovie ^{3}^{3}3http://goldberg.berkeley.edu/jesterdata/. Table I summarizes the statistics of three datasets. Each of the datasets can be converted to a list of binarylabeled instances, on which binary classifiers could be built for the applications of the three realworld scenarios: Personalized Spam Email Filtering, MHCI Binding Prediction, and Movie Recommender System.
We compared two versions of the ROMCO algorithms with two batch learning methods: multitask feature learning (MTFL) [24] and tracenorm regularized multitask learning (TRML) [21], as well as six online learning algorithms: online multitask learning (OMTL) [34], online passiveaggressive algorithm (PA) [36], confidenceweighted online collaborative multitask learning (CWCOL) [35] and three recently proposed online multitask learning algorithms: OMTLVON, OMTLLOG [7] and OSMTLe [38]. Due to the expensive computational cost in the batch models, we modified the setting of MTFL and TRML to handle online data by periodically retraining them after observing 100 samples. All parameters for MTFL and TRML were set by default values. To further examine the effectiveness of the PA algorithm, we deployed two variations of this algorithm as described below: PAGlobal learns a single classification model from data of all tasks; PAUnique trains a personalized classifier for each task using its own data. The parameter C was set to 1 for all related baselines and the ROMCO algorithms. Other parameters for CWCOL, OMTLVON(OMTLLOG) and OSMTLe were tuned with a grid search on a heldout random shuffle. The four parameters, , , and for ROMCONuCl/LogD, were tuned by the grid search ,…, on a heldout random shuffle.
Algorithm  User1  User2  User3  User4  
Error Rate 

Error Rate 

Error Rate 

Error Rate 


MTFL  13.16(1.21)  88.61(1.06)  8.72(0.58)  94.67(0.38)  14.84(0.91)  86.70(0.88)  16.87(0.78)  83.59(0.77)  
TRML  17.71(0.99)  84.61(0.88)  12.45(0.94)  92.24(0.65)  13.89(0.73)  87.57(0.66)  20.78(1.08)  79.67(1.09)  
PAGlobal  6.15(0.45)  94.54(0.41)  8.59(0.82)  94.72(0.51)  4.12(0.30)  96.33(0.27)  9.75(0.61)  90.20(0.64)  
PAUnqiue  5.05(0.49)  95.51(0.44)  8.28(0.85)  94.91(0.52)  3.67(0.36)  96.73(0.32)  8.43(0.86)  91.52(0.89)  
CWCOL  5.10(0.61)  95.45(0.55)  6.58(0.60)  95.91(0.38)  4.14(0.12)  96.29(0.10)  7.95(0.73)  92.08(0.73)  
OMTL  5.00(0.48)  95.55(0.43)  8.01(0.77)  95.07(0.48)  3.55(0.29)  96.84(0.26)  8.24(0.73)  91.71(0.75)  
OSMTLe  7.34(0.69)  93.46(0.59)  10.25(0.89)  93.61(0.57)  5.86(0.69)  94.74(0.61)  10.42(1.18)  89.68(1.07)  
OMTLVON  18.88(5.70)  85.55(3.70)  19.76(0.10)  89.04(0.06)  3.54(0.30)  96.85(0.29)  10.54(2.47)  89.52(2.97)  
OMTLLOG  4.81(0.36)  95.73(0.32)  7.58(0.65)  95.36(0.39)  2.91(0.18)  97.41(0.16)  7.16(0.53)  92.87(0.51)  
ROMCONuCl  4.12(0.50)  96.34(0.44)  7.06(0.49)  95.68(0.30)  2.87(0.42)  97.43(0.38)  6.85(0.68)  93.23(0.66)  
ROMCOLogD  4.00(0.42)  96.45(0.37)  7.31(0.45)  95.55(0.27)  2.74(0.16)  97.56(0.14)  6.68(0.44)  93.40(0.43) 
6.1.2 Evaluation Metric
We evaluated the performance of the aforementioned algorithms by two metrics: 1) Cumulative error rate: the ratio of predicted errors over a sequence of instances. It reflects the prediction accuracy of online learners. 2) F1measure: the harmonic mean of precision and recall. It is suitable for evaluating the learner’s performance on classimbalanced datasets. We followed the method of [56] by randomly shuffling the ordering of samples for each dataset and repeating the experiment 10 times with new shuffles. The average results and its standard deviation are reported below.
Algorithm  Spam Email  MHCI  EachMovie 
TRML  73.553  361.42  391.218 
MTFL  78.012  198.90  302.170 
PAGlobal  0.423  1.79  23.337 
PAUnique  0.340  1.53  26.681 
CWCOL  0.86  4.35  31.002 
OMTL  26.428  40.314  85.105 
OSMTLe  0.360  2.327  11.43 
OMTLVON  1.230  1.785  21.586 
OMTLLOG  1.145  1.371  20.232 
ROMCONuCl  11.49  4.88  33.235 
ROMCOLogD  10.59  5.352  28.716 
6.2 Spam Email Filtering
We applied online multitask learning to build effective personalized spam filters. The task is to classify each new incoming email massage into two categories: legitimate or spam. We used a dataset hosted by the Internet Content Filtering Group. The dataset contains 7068 emails collected from mailboxes of four users (denoted by user1, user2, user3, user4). Basically, the set of all emails received by a user was not specifically generated for that user. However, the characteristic of each user’s email could be said to match user’s interest. Each mail entry was converted to a word document vector using the TFIDF (term frequencyinverse document frequency) representation.
Since the email dataset had no timestamp, each email list was shuffled into a random sequence. The cumulative error rate and F1measure results of 10 shuffles are listed in Table II. In addition, the cumulative error rate for the four specific users along the learning process is presented in Fig. 3. We also report each algorithm’s runtime, i.e., the time consumed by both training and test phase during the complete online learning process in Table III. From these results, we can make several observations.
First, the proposed ROMCONuCl/LogD outperform other online learners in terms of the error rate and F1measure. In particular, in accordance to the results from the four specific users, learning tasks collaboratively with both the common and personalized structures consistently beats both the global model and the personalized model.
Second, the performance of the proposed online multitask learning methods are better than that of the two batch learning algorithms (MTFL and TRML). It should be noted that compared to online learners which update models based only on the current samples, batch learning methods have the advantage of keeping a substantial amount of recent training samples, at the cost of storage space and higher complexity. In fact, the proposed ROMCONuCL/LogD are more efficient than the batch incremental methods, e.g., it could be more than 100 times faster than batch MTFL in largesized dataset (28.72 secs versus 302.17 secs in EachMovie as shown in Table III). ROMCONuCL/LogD do not store recent training samples. They only use the current training sample and a simple rule to update the model. In contrast, batch learning algorithms need to keep a certain number of recent training samples in memory, learning to extra burden on storage and complexity. In addition, both MTFL and TRML need to solve an optimization problem in an iterative manner. For practical applications involving hundreds of millions of users and features, the batch learning algorithms are no longer feasible, while online learners remain highly efficient and scalable.
We also observed that ROMCONuCL/LogD are slightly slower than CWCOL, OMTLVON/LOG and OSMTLe. This is expected as ROMCONuCL/LogD have to update two component weight matrices. However, the extra computational cost is worth considering the significant improvement over the two measurements achieved by using the two components.
6.3 MHCI Binding Prediction
Algorithm  Error Rate  Positive Class  Negative Class 
F1measure  F1measure  
MTFL  43.84(6.05)  51.04(10.12)  59.12(7.35) 
TRML  44.26(5.98)  50.50(9.97)  58.80(7.40) 
PAGlobal  44.70(2.68)  45.44(9.86)  61.28(3.17) 
PAUnqiue  41.62(3.95)  51.08(10.23)  63.02(2.89) 
CWCOL  41.32(4.46)  50.89(10.70)  63.59(3.08) 
OMTL  41.56(3.97)  51.13(10.25)  63.08(2.89) 
OSMTLe  42.78(0.65)  50.48(0.43)  61.59(0.96) 
OMTLVON  38.13(5.03)  54.73(10.44)  66.39(4.02) 
OMTLLOG  38.08(5.16)  54.83(10.56)  66.40(4.13) 
ROMCONuCl  38.09(5.32)  55.03(10.53)  66.34(4.09) 
ROMCOLogD  37.91(4.97)  55.09(10.10)  66.55(4.14) 
Computational methods have been widely used in bioinformatics to build models to infer properties from biological data [57, 58]. In this experiment, we evaluated several methods to predict peptide binding to human MHC (major histocompatibility complex) class I molecules. It is known that peptide binding to human MHCI molecules plays a crucial role in the immune system. The prediction of such binding has valuable application in vaccine designs, the diagnosis and treatment of cancer, etc. Recent work has demonstrated that there exists common information between related molecules (alleles) and such information can be leveraged to improve the peptide MHCI binding prediction.
We used a binarylabeled MHCI dataset. The data consists of 18664 peptide sequences for 12 human MHCI molecules. Each peptide sequence was converted to a 400 dimensional feature vector following [35]. The goal is to determine whether a peptide sequence (instance) binds to a MHCI molecule (task) or not, i.e., binder or nonbinder.
We reported the average cumulative error rate and F1measure of 12 tasks in Table IV. To make a clear comparison between the proposed ROMCONuCL/LogD and baselines, we showed the variation of their cumulative error rate along the entire online learning process averaged over the 10 runs in Fig. 4.
From these results, we first observed that the permutations of the dataset have little influence on the performance of each method, as indicated by the small standard deviation values in Table IV. Note that the majority of the dataset belongs to the negative class, thus predicting more examples as the majority class decreases the overall error rate, but also degrades the accuracy of the minority positive class. The consistently good performance achieved by the proposed ROMCONuCL/LogD in terms of the error rate and F1measures of both classes further demonstrates effectiveness of our algorithms over imbalanced datasets. Moreover, among the six online models, learning related tasks jointly still achieves better performance than learning the tasks individually, as shown by the improvement of ROMCO and OMTL models over the PAUnique model.
Algorithm  Error Rate  Positive Class  Negative Class 
F1measure  F1measure  
MTFL  27.51(12.25)  79.18(12.87)  36.06(14.85) 
TRML  26.58(11.82)  79.89(12.49)  37.64(15.05) 
PAGlobal  31.80(5.87)  74.43(8.61)  47.96(14.47) 
PAUnqiue  19.68(7.39)  82.97(9.35)  57.80(21.05) 
CWCOL  25.45(6.96)  78.89(9.30)  53.95(16.71) 
OMTL  19.44(7.28)  83.18(9.29)  57.77(21.39) 
OSMTLe  20.73(7.15)  82.31(9.04)  58.76(18.71) 
OMTLVON  18.61(7.29)  84.45(8.64)  55.92(23.94) 
OMTLLOG  18.61(7.29)  84.45(8.64)  55.92(23.94) 
ROMCONuCl  19.14(7.20)  83.46(9.25)  58.27(21.17) 
ROMCOLogD  18.21(6.71)  84.63(8.41)  55.53(25.02) 
6.4 Movie Recommender System
In recent years, recommender systems have achieved great success in many realworld applications. The goal is to predict users’ preferences on targeted products, i.e., given partially observed usermovie rating entries in the underlying (ratings) matrix, we would like to infer their preference for unrated movies.
We used a dataset hosted by the DEC System Research Center that collected the EachMovie recommendation data for 18 months. During that time, 72916 users entered a total of 2811983 numeric ratings for 1628 different movies. We randomly selected 30 users (tasks) who viewed exactly 200 movies with their rating as the target classes. Given each of the 30 users, we then randomly selected 1783 users who viewed the same 200 movies and used their ratings as the features of the movies. The six possible ratings (i.e., ) were converted into binary classes (i.e., like or dislike) based on the rating order.
Table V shows the comparison results in terms of the average cumulative error rate and F1measure. Fig. 5 depicts the detailed the cumulative error rate along the entire online learning process over the averaged 30 tasks of the EachMovie dataset. From these results, we can make several conclusions.
First, it can be seen that the proposed ROMCONuCl/LogD outperform other baselines: ROMCONuCl/LogD always provide smaller error rates and higher F1measures compared to other baselines. It shows that our algorithms can maintain a high quality of prediction accuracy. We believe that the promising result is generally due to two reasons: First, the personalized and correlative patterns are effective to discover the personalized tasks and task relativeness, and these patterns are successfully captured in three realworld datasets. Second, once an error occurs in at least one task, ROMCONuCl/LogD would update the entire task matrix. That would benefit other related tasks with few learning instances since the shared subspaces would be updated accordingly.
Next, we observed that ROMCOLogD is better than ROMCONuCl in Fig. 5 in terms of the error rate and F1measure. This is expected because compared to the nuclear norm, ROMCOLogD is able to achieve better rank approximation with the logdeterminant function, i.e., it reduces the contribution of large singular values while approximating the small singular values into zeros.
6.5 Effect of the Regularization Parameters
We used Spam Email and Human MHCI datasets as the cases for parameter sensitivity analysis. In the Spam Email dataset, by fixing as well as varying the value of in the tuning set, i.e., , we studied how the parameter affects the classification performance of ; by fixing as well as varying the value of in tuning set of , we study how the parameter affects the performance of ROMCONuCl/LogD. Similarly, in the Human MHCI dataset, we studied the pair of by fixing with the tuning set of and by fixing with tuning set of . In Fig. 6, we show the classification performance of ROMCO in terms of the error rate for each pair of . From Fig. 6, we observed that the performance is worse with an increment of either or over the Spam Email dataset. It indicates a weak relativeness among the tasks and many personalized tasks existing in the Email dataset. In Human MHCI, the poor performance is triggered by a small value of or a large value of . Compared with the Email data, MHCI contains fewer personalized tasks, meanwhile most tasks are closely related and well represented by a lowdimensional subspace.
7 Conclusion
We proposed an online MTL method that can identify sparse personalized patterns for outlier tasks, meanwhile capture a shared lowrank subspace for correlative tasks. As an online technique, the algorithm can achieve a low prediction error rate via leveraging previously learned knowledge. As a multitask approach, it can balance this tradeoff by learning all the tasks jointly. In addition, we proposed a logdeterminant function to approximate the rank of the matrix, which, in turn, achieves better performance than the one with the nuclear norm. We show that it is able to achieve a sublinear regret bound with respect to the best linear model in hindsight, which can be regarded as a theoretical support for the proposed algorithm. Meanwhile, the empirical results demonstrate that our algorithms outperform other stateoftheart techniques on three realworld applications. In future work, online active learning could be applied in the MTL scenario in order to save the labelling cost.
Appendix
root computation of the LogDeterminant function
To solve the logdeterminant function, we set the derivative of Eq. (20) for each to zero with ,
Assume that , , and , we define that , where
Then the three possible roots of the above cubic equation include one real root and two complex roots,