Lu Bai1, Yew-Soon Ong1, Tiantian He1, Abhishek Gupta2
1School of Computer Science and Engineering, Nanyang Technological University, Singapore
2Singapore Institute of Manufacturing Technology (SIMTech), A*STAR, Singapore
###### Abstract

Multi-label learning studies the problem where an instance is associated with a set of labels. By treating single-label learning problem as one task, the multi-label learning problem can be casted as solving multiple related tasks simultaneously. In this paper, we propose a novel Multi-task Gradient Descent (MGD) algorithm to solve a group of related tasks simultaneously. In the proposed algorithm, each task minimizes its individual cost function using reformative gradient descent, where the relations among the tasks are facilitated through effectively transferring model parameter values across multiple tasks. Theoretical analysis shows that the proposed algorithm is convergent with a proper transfer mechanism. Compared with the existing approaches, MGD is easy to implement, has less requirement on the training model, can achieve seamless asymmetric transformation such that negative transfer is mitigated, and can benefit from parallel computing when the number of tasks is large. The competitive experimental results on multi-label learning datasets validate the effectiveness of the proposed algorithm.

## I Introduction

Multi-label learning deals with the problem that one instance is associated with multiple labels, such as a news document can be labeled as sports, Olympics, and ticket sales [1]. Formally, let denote the -dimensional feature space and denote the label space with class labels. Given the multi-label training set , where is number of instances, is the feature vector for the -th instance and is the set of labels associated with the -th instance. The task of multi-label learning is to learn a function from which can assign a set of proper labels to an instance.

One straightforward method to solve the multi-label learning problem is to decompose the problem into a set of independent binary classification problems [2]. This strategy is easy to implement and existing single-label classification approaches, e.g., logistic regression and SVM, can be utilized directly. However, as can be seen by the news document example, an instance with the Olympics label has a high probability to have the label of sports. The correlations among the labels may provide useful information for one another and help to improve the performance of multi-label learning [1, 3].

Over the past years, a lot of methods have been proposed to improve the performance of multi-label learning by exploring the label correlations. Methods such as classifier chains [4], calibrated label ranking [5], and random -labelsets [6] usually have high complexity with a large number of class labels. [7, 8, 9] considered taking the label correlations as prior knowledge and incorporating it into the model training to utilize the label correlations. [10, 11] exploited label correlations through learning a latent label representation and optimizing label manifolds. [12] explored the correlations by solving an optimization problem which models the contribution of related labels, and then incorporating the learned correlations into the model training. In the existing approaches, a well-designed training model is required to achieve notable performances.

The rest of the paper is organized as the follows. Previous works related to multi-label learning are firstly reviewed. Secondly, we introduce the proposed MGD and provide the theoretical analysis, including model convergence and computational complexity. Thirdly, we present how the proposed MGD is extensively tested on real multi-label learning datasets and compared with strong baselines. At last, we summarize the proposed approach and the contributions of the paper.

## Ii Related Work

Based on the order of information being considered, existing multi-label learning approaches can be roughly categorized into three major types [1]. For first-order methods, the label correlations are ignored and the multi-label learning problem is handled in a label by label manner, such as BR [2] and LIFT [14]. Second-order methods consider pairwise relations between labels, such as LLSF [8] and JFSC [9]. High-order methods, where high-order relations among label subsets or all the labels are considered, such as RAkEL [6], ECC [4], LLSF-DL [8], and CAMEL [12]. Generally, the higher the order of correlations being considered, the stronger is the correlation-modeling capabilities, while on the other hand, the more computationally demanding and less scalable the approach becomes.

Treating a single-label learning problem as one task, the multi-label learning problem can be seen as a special case of multi-task learning problem, where the feature vectors for are the same for different tasks. In majority of multi-task learning method, the relations among the tasks are promoted through regularization in the overall objective function that composed of all the tasks’ parameters, such as feature based approaches [15, 16, 17, 18, 19] and task relation based approaches [20, 21, 22]. Specifically, for the second-order multi-label learning approaches in [7, 8, 9], the label correlation matrix, which is taken as a prior knowledge obtained based on the similarity between label vectors, is often incorporated as a structured norm regularization term that regulates the learning hypotheses or perform label-specific feature selection and model training.

In contrast to the existing multi-label and multi-task learning approaches which incorporate correlation information into the model training process in the form of regularization, MGD serves as the first attempt to incorporate the correlations by transferring model parameter values during the optimization process of each task, i.e., when minimizing its individual cost function.

## Iii The MGD Approach

In this section, we elaborate the proposed MGD algorithm for multi-label learning. We firstly introduce the mathematical notations used in the manuscript. We then generically formulate the multi-label learning problem and introduce how MGD can effectively solve multi-label learning problem via the reformative gradient descent where the correlated parameters are transferred across multiple tasks. At last, we perform the theoretical analysis of MGD, including convergence proof and computational complexity.

Throughout this paper, normal font small letters denote scalars, boldface small letters denote column vectors, and capital letters denote matrices. denotes zero column vector with proper dimension, denotes identity matrix of size . denotes the transpose of matrix and denotes the Kronecker product. denotes a concatenated column vector formed by stacking on top of each other, and denotes a diagonal matrix with the -th diagonal element being . The norm without specifying the subscript represents the Euclidean norm by default. Following the notations used in Introduction, we alternatively represent the training set as where denotes the instance matrix and denotes the label matrix. In addition, we denote the training set for label as where is the -th column vector of the label matrix .

### Iii-a Problem Formulation

Treating each single-label learning problem as one task, we have tasks to be solved simultaneously. Each task aims to minimize its own cost function

 minwi fi(wi), (1)

where is the model parameter and is the cost function of the -th task with training dataset . In this paper, we do not restrict the specific form of the cost functions. In particular, the cost functions is assumed to be strongly convex, twice differentiable, and the gradient of is Lipschitz continuous with constant , i.e.,

 ∥∇fi(u)−∇fi(v)∥≤Lfi∥u−v∥,∀u,v∈Rd.

Cost functions such as mean squared error with norm 2 regularization and cross-entropy with norm 2 regularization apply. Non-differentiable cost functions where norm 1 regularization is used can also be approximated considered [23]. Since is strongly convex and twice differentiable, there exists positive constant such that . As a result, we have

 ξiId≤∇2fi(u)≤LfiId, ∀u∈Rd.

### Iii-B The Proposed Framework

Equation (1) is solved using the gradient descent iteration,

 wt+1i=wti−α∇fi(wti), (2)

where is the iteration index, is the step size, and is the gradient of at . As there are relations among tasks, we are able to improve the learning performance by considering the correlation of parameters belonging to different tasks. Based on this idea, we propose a reformative gradient descent iteration, which allows the values of the model parameters during each iteration to be transferred across similar tasks. The MGD is designed as follows,

 wt+1i=T∑j=1mtijwtj−α∇fi(wti), i=1,...,T, (3)

where is the transfer coefficient describes the information flow from task to task , which satisfies the following conditions,

 mtij≥0, (4a) T∑j=1mtij=1. (4b)

From (4b), we have . Rewriting iteration (3) as follows

 wt+1i= mtiiwti+∑j≠imtijwtj−α∇fi(wti) = (1−∑j≠imtij)wti+∑j≠imtijwtj−α∇fi(wti).

can be rescaled as

 ¯mtij={1ασmtij,j≠i,1−1ασ∑j≠imtij,j=i, (5)

where is a positive constant and satisfies the condition

 1−1ασ∑j≠imtij>0. (6)

Given (5), is parameterized by . With the rescaling, the iteration in (3) can be alternatively expressed as

 wt+1i =(1−ασ∑j≠i¯mtij)wti+ασ∑j≠i¯mtijwtj−α∇fi(wti) = wti−ασ(1−¯mtii)wti+ασ∑j≠i¯mtijwtj−α∇fi(wti) = (1−ασ)wti+ασT∑j=1¯mtijwtj−α∇fi(wti). (7)

### Iii-C Convergence Analysis

In this section, we give the convergence property of the proposed MGD iteration based on the expression in (III-B).

Denote as the best coefficient of label predictor for task , , and . The following theorem gives the convergence property of the iteration (III-B) under certain conditions on the step-size parameter .

###### Theorem 1.

Under the iteration in (III-B) with the transfer coefficient satisfies

 T∑j=1¯mtij=1, ∀i, ¯mtij≥0, ∀i,j,

is convergent if the step size is chosen to satisfy

 0<α<22σ+¯Lfi. (8)

Specifically,

 limt→∞maxi∥~wit∥ ≤ 2ασmaxi∥w∗i∥+αmaxi∥∇fi(w∗i)∥1−(¯γ+ασ), (9)

where .

###### Proof.

Let the -th element of at iteration time being , denote , , and . Note that we are using the typeface to distinguish this from the single vector-valued variable . Write (III-B) into a concatenated form gives

 wt+1=(1−ασ)wt+ασ¯Mtwt−α∇f(wt). (10)

Denote and . Subtracting from both sides of (10) gives

 ~wt+1= ((1−ασ)IdT+ασ¯Mt)wt−w∗−α∇f(wt) = ((1−ασ)IdT+ασ¯Mt)~wt −α(∇f(wt)−∇f(w∗)) (11) +α(σ(¯Mt−IdT)w∗−∇f(w∗)) = ((1−ασ)IdT+ασ¯Mt)~wt −α∫10∇2f(w∗+μ(wt−w∗))dμ~w +α(σ(¯Mt−IdT)w∗−∇f(w∗)) = ((1−ασ)IdT+ασ¯Mt−αHt)~wt +α(σ(¯Mt−IdT)w∗−∇f(w∗)), (12)

where . It can be verified that is a block diagonal matrix and the block diagonal elements for are Hermitian. We use the block maximum norm defined in [24] to show the convergence of the above iteration. The block maximum norm of a vector with is defined as [24]

 ∥x∥b,∞=maxi∥xi∥.

The induced matrix block maximum norm is therefore defined as [24]

 ∥A∥b,∞=maxx≠0∥Ax∥b,∞∥x∥b,∞.

From the iteration in (III-C) we have

 ∥~wt+1∥b,∞≤∥((1−ασ)IdT+ασ¯Mt−αHt)~wt∥b,∞ +α∥σ(¯Mt−IdT)w∗−∇f(w∗)∥b,∞ ≤ ∥(1−ασ)IdT+ασ¯Mt−αHt∥b,∞∥~wt∥b,∞ +α∥σ(¯Mt−IdT)w∗−∇f(w∗)∥b,∞ ≤ (∥(1−ασ)IdT−αHt∥b,∞+ασ∥¯Mt∥b,∞)∥~wt∥b,∞ +α∥σ(¯Mt−IdT)w∗−∇f(w∗)∥b,∞.

From Lemma D.3 in [24], we have

 ∥¯Mt∥b,∞=∥¯Mt∥∞=1,

where the last equality comes from the fact that and the row summation of is one. Since , . Thus, where . By the definition of induced matrix block maximum norm, we have

 ∥(1−ασ)IdT−αHt∥b,∞ = maxx≠0∥((1−ασ)IdT−αHt)x∥b,∞∥x∥b,∞ = maxx≠0maxi∥((1−ασ)Id−αHti)xi∥∥x∥b,∞ ≤ maxx≠0maxi∥((1−ασ)Id−αHti)∥∥x∥b,∞∥x∥b,∞
 = maxi∥(1−ασ)Id−αHti∥ ≤ ¯γ,

where . Thus,

 ∥~wt+1∥b,∞≤ (¯γ+ασ)∥~wt∥b,∞ +α∥σ(¯Mt−IdT)w∗−∇f(w∗)∥b,∞ ≤ (¯γ+ασ)∥~wt∥b,∞ +2ασ∥w∗∥b,∞+α∥∇f(w∗)∥b,∞. (13)

By choosing the step size to satisfy , the iteration asymptotically converges. To ensure , it is sufficient to ensure

 |1−ασ−αξi|+ασ<1 and |1−ασ−αLfi|+ασ<1, ∀i,

 0<α<22σ+¯Lfi.

From the iteration in (III-C), we have

 ∥~wt+1∥b,∞≤(¯γ+ασ)t+1∥~w0∥b,∞ +(2ασ∥w∗∥b,∞+α∥∇f(w∗)∥b,∞)t∑k=0(¯γ+ασ)k.

Under the condition that ,

 limt→∞∥~wt∥b,∞≤2ασ∥w∗∥b,∞+α∥∇f(w∗)∥b,∞1−(¯γ+ασ).

From the definition of block maximum norm, (1) is obtained. ∎

In iteration (3), the transfer coefficient between task and task is a scaler. In the following, we consider the element-wise feature similarities between task and task . The transfer coefficient between task and task is assumed to be a diagonal matrix with its -th diagonal element being the transfer coefficient from the -th element of to the -th element of . The MGD iteration in (3) is then becomes

 wt+1i=T∑j=1Ptijwtj−α∇fi(wti), (14)

where

 T∑j=1Ptij=Id, Ptij,k≥0, ∀i,j=1,...,T,k=1,...,d. (15)

Following the same rescaling,

 ¯Ptij={1ασPtij,j≠i,Id−1ασ∑j≠iPtij,j=i, (16)

(14) becomes

 wt+1i=(1−ασ)wti+ασT∑j=1¯Ptijwtj−α∇fi(wti). (17)
###### Corollary 1.

Under (17) with the transfer coefficient satisfies

 T∑j=1¯Ptij=Id, ¯Ptij,k≥0, ∀i,j=1,...,T,k=1,...,d,

is convergent if the following conditions are satisfied:

 σ<¯LfiT−1, \emphfor T>1, 0<α<2(T+1)σ+¯Lfi.
###### Proof.

Let the -th block element of being . Following the similar procedure of the proof of Theorem 1, we obtain

 ∥~wt+1∥b,∞ ≤ (∥(1−ασ)IdT−αHt∥b,∞+ασ∥¯Pt∥b,∞)∥~wt∥b,∞ +α∥σ(¯Pt−IdT)w∗−∇f(w∗)∥b,∞. (18)

Let being a block column vector with .

 ∥¯Ptx∥b,∞= maxi∥T∑j=1¯Ptijxj∥ ≤ maxiT∑j=1∥¯Ptij∥∥xj∥ ≤ (maxiT∑j=1∥¯Ptij∥)maxj∥xj∥.

Recall that is a diagonal matrix and the elements therein are all no greater than 1, thus, . As a result

 ∥¯Ptx∥b,∞≤Tmaxj∥xj∥.

By the definition of matrix block maximum norm, we have

 ∥¯Pt∥b,∞≤T.

The condition to ensure convergence of the iteration in (III-C) becomes

 ¯γ+ασT<1,

which gives

 σ<¯LfiT−1, for T≠1, 0<α<2(T+1)σ+¯Lfi.

### Iii-D Relation with Multi-Task Learning

From the iteration in (17), we have

 wt+1i=(1−ασ)wti+ασT∑j=1¯Ptijwtj−α∇fi(wti) =wti−α(σT∑j=1Ptij(wti−wtj)+∇fi(wti)). (19)

If fix for all , then, the last term in the brackets can be seen as the gradient of the following function

 ¯fi(wi,w−i)=fi(wi)+12σT∑j=1(wi−wj)T¯Pij(wi−wj),

where denotes the collection of other tasks’ variables, i.e., . Thus, the iteration in (III-D) with fixed can be seen as the gradient descent algorithm which solves the following Nash equilibrium problem

 minwi ¯fi(wi,w−i), i=1,...,T. (20)

In (20), each task’s objective function is influenced by other tasks’ decision variables. Since the objective function is continuous in all its arguments, strongly convex with respect to for fixed , and satisfies as for fixed , an Nash equilibrium exists [25]. Furthermore, as a result of strongly convexity, the gradient of with respect to for fixed is strongly monotone. Thus, the Nash equilibrium for (20) is unique [26]. Denote the Nash equilibrium of (20) as , . It is known that the Nash equilibrium satisfies the following condition [25]:

 woi=argminwi¯fi(wi,wo−i), i=1,...,T,

which implies

 ∇fi(woi)+σT∑j=1¯Pij(woi−woj)=0, i=1,...,T.

Write the conditions in (21) in a concatenated form gives

 ∇f(wo)+σ(IdT−¯P)wo=0. (21)

It has been pointed out in [27] that the regularized multi-task learning algorithms can be classified into two main categories, learning with feature covariance and learning with task relations. The objective functions of the two learning categories are

 minT∑i=1Li(wi)+12λwT(IT⊗Θ−1)w+g1(Θ) (22)

and

 minT∑i=1Li(wi)+12λwT(Σ−1⊗Id)w+g2(Σ), (23)

where is the training loss of task , is a positive regularization parameter, models the covariance between the features, models the task relations, and denote constraints on and , respectively. For comparison, we eliminate the constraints on and , consider the case that and are fixed, and let . Denote the optimal solutions of problems (22) and (23) as and , respectively. The optimal solutions satisfy the following conditions:

 ∇f(wg1)+12λ(IT⊗Θ−1+(IT⊗Θ−1)T)wg1=0, (24a) ∇f(wg2)+12λ(Σ−1⊗Id+(Σ−1⊗Id)T)wg2=0. (24b)

Comparing the optimality conditions (21) and (24) for the Nash equilibrium problem (20) and the multi-task learning problems (22)-(23), we find that by setting

 σ(IdT−¯P)=12λ(IT⊗Θ−1+(IT⊗Θ−1)T)

or

 σ(IdT−¯P)=12λ(Σ−1⊗Id+(Σ−1⊗Id)T),

the optimal solution will be the same as or . Thus, the general multi-tasking learning problem can be solved by the MGD algorithm by setting the coefficients between task and task properly. In addition, using MGD, we can consider both task-task relations as well as feature-feature relations at the same time. Furthermore, in MGD, is not required to equal to . This relaxation allows asymmetric task relations in multi-task learning, which can not be achieved by most multi-task learning methods as (24) shows.

### Iii-E Incorporating Second-Order Label Correlations

The transfer coefficients can be designed or learned by many different methods. In multi-label learning problems, the similarity between task and task can be modeled by the correlation between labels and . In this paper, we use the cosine similarity to calculate the correlation matrix. The proposed MGD is summarized in Algorithm 1.

After learning the model parameter , we can predict the label for a test instance by the corresponding prediction function associated with the cost function, and the final predicted label vector is .

### Iii-F Complexity Analysis

We mainly analyze the complexity of the iteration parts listed in Algorithm 1. In each iteration, the gradient calculation leads to a complexity of , where is the complexity of calculating the gradient w.r.t. the dimension , which is determined by the actual cost function, and the update of the model parameter according to (3) needs . Therefore, the overall complexity of the MGD algorithms is of order , where is the iteration times.

## Iv Experiments

In this section, we extensively compared the proposed MGD algorithm with related approaches on real-world datasets. For the proposed MGD algorithm, we reformulate the multi-label learning problem, which can be decomposed into a set of binary classification tasks. For each of the classification tasks, we use the function of 2-norm regularized logistic regression. Thus, for any task , the following objective function is optimized by the proposed algorithm,

 minwifi(wi)= −1nn∑j=1(yijlogh(zij) +(1−yij)log(1−h(zij)))+12ρ∥wi,−1∥2,

where , , is the model parameter, is the remaining elements in except the first element, and is the regularization parameter. The gradient of over is

 ∇fi=1nn∑j=1(h(zij)−yij)[1 xTj]T+ρ[0wi,−1].

Let

 X=⎡⎢ ⎢⎣1xT1⋮⋮1xTn⎤⎥ ⎥⎦,yi=⎡⎢ ⎢⎣yi1⋮yin⎤⎥ ⎥⎦.

The MGD iteration is

 wt+1i= T∑jmtijwtj−αnXT(g(Xwti)−yi)−αρ[0wi,−1],

where .

The -th label prediction for an instant is predicted 1 if and 0 otherwise, where is the threshold. In the experiment, is chosen from .

### Iv-a Experimental Setup

#### Iv-A1 Datasets

We conduct the multi-label classification on six benchmark multi-label datasets, including regular-scale datasets: emotions, genbase, cal500, and enron; and large-scale datasets: corel5k and bibtex. The details of the datasets are summarized in Table I, where , , L, , and represent the number of examples, the number of features, the number of class labels, the average number of labels per example, and feature type of dataset , respectively. The datasets are downloaded from the website of Mulan [28].

#### Iv-A2 Evaluation Metrics

Five widely used evaluation metrics are employed to evaluate the performance, including Average precision, Macro-averaging F1, Micro-averaging F1, Coverage score, and Ranking loss. Concrete metric definitions can be found in [1]. Note that for the comparison purpose, the coverage score is normalized by the number of labels. For Average precision, Macro averaging F1, and Micro averaging F1, the larger the values the better the performance. For the other two metrics, the smaller the values the better the performance.

#### Iv-A3 Comparing Algorithms

We compare our proposed method MGD with three classical algorithms including BR [2], RAkEL [6], ECC [4], and two state-of-the-art multi-label learning algorithms LIFT [14] and LLSF-DL [8].

In the experiments, we used the source codes provided by the authors for implementation. BR, ECC, and RAkEL are implemented under the Mulan multi-label learning package [28] using the logistic regression model as the base classifier. Parameters suggested in the corresponding literatures are used, i.e., RAkEL: ensemble size 2 with ; ECC: ensemble size 30; LIFT: the ratio parameter is tuned in {0.1,0.2,…,0.5}; LLSF-DL: ,