Deep Ranking Based Costsensitive Multilabel Learning for Distant Supervision Relation Extraction
Abstract
Knowledge base provides a potential way to improve the intelligence of information retrieval (IR) systems, for that knowledge base has numerous relations between entities which can help the IR systems to conduct inference from one entity to another entity. Relation extraction is one of the fundamental techniques to construct a knowledge base. Distant supervision is a semisupervised learning method for relation extraction which learns with labeled and unlabeled data. However, this approach suffers the problem of relation overlapping in which one entity tuple may have multiple relation facts. We believe that relation types can have latent connections, which we call class ties, and can be exploited to enhance relation extraction. However, this property between relation classes has not been fully explored before. In this paper, to exploit class ties between relations to improve relation extraction, we propose a general ranking based multilabel learning framework combined with convolutional neural networks, in which ranking based loss functions with regularization technique are introduced to learn the latent connections between relations. Furthermore, to deal with the problem of class imbalance in distant supervision relation extraction, we further adopt costsensitive learning to rescale the costs from the positive and negative labels. Extensive experiments on a widely used dataset show the effectiveness of our model to exploit class ties and to relieve class imbalance problem.
keywords:
Distant supervision, relation extraction, class ties, class imbalance, multilabel learning, costsensitive learning, deep ranking1 Introduction
Relation extraction (RE) aims to classify the relations (or called relation facts) between two given named entities from naturallanguage text. Fig. 1 shows two sentences with the same entity tuple but two different relation facts. RE is to accurately extract the corresponding relation facts (place_of_birth, place_lived) for the entity tuple (Patsy Ramsey, Atlanta) based on the contexts of sentences. Supervisedlearning methods require numerous labeled data to work well. With the rapid growth of volume of relation types, traditional methods can not keep up with the step for the limitation of labeled data. In order to narrow down the gap of data sparsity, 2009 () proposes distant supervision (DS) for relation extraction, which automatically generates training data by aligning a knowledge facts database (ie. Freebase freebase ()) to texts. For a fact (e.g. entity tuple with a relation type) from the knowledge base, the sentences containing the entity tuple in the fact are regarded as the training data.
Class ties mean the connections (relatedness) between relations types for relation extraction. In general, we conclude that class ties can have two categories: weak class ties and strong class ties. Weak class ties mainly involve the cooccurrence of relations such as place_of_birth and place_lived, CEO_of and founder_of. Besides, strong class ties mean that relations have latent logical entailments. Take the two relations of capital_of and city_of for example, if one entity tuple has the relation of capital_of, it must express the relation fact of city_of, because the two relations have the entailment of capital_of city_of. Obviously the opposite induction is not correct. Further take the following sentence of
Jonbenet told me that her mother never left since she was born.
for example. This sentence expresses two relation facts which are place_of_birth and place_lived. However, the word “born” is a strong bias to extract place_of_birth, so it may not be easy to predict the relation of place_lived, but extracting place_of_birth will provide evidence for prediction of place_lived by incorporating the weak ties between the two relations,
Exploiting class ties is necessary for DS based relation extraction. In DS scenario, there is a challenge that one entity tuple can have multiple relation facts which is called relation overlapping 2011 (); 2012 (), as shown in Fig. 1. However, the relations of one entity tuple can have class ties mentioned above which can be leveraged to enhance relation extraction, for that it narrows down potential searching spaces and reduces uncertainties between relations when predicting unknown relations, such that if one pair of entities has CEO_of relation, it will contain founder_of relation with high possibility.
To exploit class ties between relations, we propose to make joint extraction by considering pairwise connections between positive and negative labels inspired by furnkranz2008multilabel (); zhang2006multilabel (). As the example for one entity tuple with two different relation types shown in Fig. 1, by extracting the two relations jointly, we can maintain the class ties (cooccurrence) of them and the class ties can be learned by potential models, which can be leveraged to extract instances with unknown relations. We introduce a ranking based multilabel learning framework to make joint extraction, to learn to rank the prediction probability for positive relations higher than negative ones. We design ranking based loss functions for multilabel learning. Furthermore, inspired by zhouMIML (); MIML2005 (), we add a regularization term to the loss functions to better learn the relatedness between relation facts, and we only regularize the positive relation types ignoring the relation of NR (does not express any relation) based on the assumption that the connections between relations are only in positive relations but not in NR (see Sec. 3.4).
Besides, class imbalance is the another severe problem which can not be ignored for distant supervision relation extraction. We find that around training data express NR relation type and even more than in test set, so samples with NR type count a much higher proportion comparing to the positive samples (not categorized as NR). This problem will severely affect the model training, causing the model easily to classify the samples to have the NR relation type japkowicz2002class (). To overcome this problem, based on the ranking loss functions, we further adopt costsensitive learning to rescale the costs from the positive and negative labels, by increasing the losses for positive labels and penalizing losses from NR type (detailed in Sec. 3.5).
Furthermore, combining information across sentences will be more appropriate for joint extraction which provides more information from other sentences to extract each relation (hao (); lin2016 ()). In Fig. 1, sentence #1 is the evidence for place_of_birth, but it also expresses the meaning of “living in someplace”, so it can be aggregated with sentence #2 to extract place_lived. Meanwhile, the word of “hometown” in sentence #2 can provide evidence for place_of_birth which should be combined with sentence #1 to extract place_of_birth.
In this work, we propose a unified model that integrates ranking based costsensitive multilabel learning with convolutional neural network (CNN) to exploit class ties between relations and further relieve the class imbalance problem. Inspired by the effectiveness of deep learning for modeling sentence features deeplearning2015 (), we use CNN to encode sentences. Similar to lin2016 (); rank2015 (), we use class embeddings to represent relation classes. The whole model architecture is presented in Fig. 2. We first use CNN to embed sentences, then we introduce two variant methods to combine the embedded sentences into one bag representation vector aiming to aggregate information across sentences, after that we measure the similarity between the bag representation and relation class in realvalued space. Finally, we use the ranking loss functions to learn to make joint extraction over multiple relation types.
Our experimental results on dataset of 2010 () are evident that: (1) Our model is much more effective than the baselines; (2) Leveraging class ties will enhance relation extraction and our model is efficient to learn class ties by joint extraction; (3) A much better model can be trained after relieving class imbalance from NR.
Our contributions in this paper can be encapsulated as follows:
We propose to leverage class ties to enhance relation extraction. Combined with CNN, an effective deep ranking based multilabel learning model with regularization technique is introduced to exploit class ties.
We adopt the costsensitive learning to relieve the class imbalance problem and experimental results show the effectiveness of our method.
2 Related Work
2.1 Relation Extraction
Previous methods on relation extraction can mainly be summarized as supervision based and distant supervision based. Supervision based methods needs much labeled data to work well which can not keep up with the rapid growth of relation types. To overcome the problem of data sparsity for supervision based methods, distant supervision relation extraction has been proposed by 2009 (). However, DS based relation extraction suffers the two problems of wrong labelling problem and overlapping problem, in which the former means that sentences containing certain entities actually do not express the relation type of the entities indicated or even do not express any relations and the latter mean that one entity tuple may have multiple relation types. To solve the problem of wrong labelling, 2010 () introduces multiinstance learning for relation extraction in which the mentions of one certain entity tuple are merged as one bag and make the model to extract relations on mention bags, however this method can not deal with the relation overlapping problem. Afterwards, 2011 () and 2012 () introduce the framework of multiinstance multilabel learning to jointly overcome the two problems and improve the performance significantly. Though they also propose to make joint extraction of relations, they only use information from single sentence losing information from other sentences. Global () tries to use Markov logic model to capture consistency between relation labels, on the contrary, our model leverages deep ranking to learn class ties automatically.
Recent years, deep learning has achieved remarkable success in computer vision and natural language processing deeplearning2015 (). Deep learning has been applied to automatically learn the features of sentences (zeng2014 (); Yu2014 (); rank2015 (); lin2016 (); DBLP:conf/pakdd/YeYLC17 (); DBLP:conf/emnlp/YeW18 (); DBLP:journals/corr/abs180208504 (); DBLP:conf/coling/JiangYLCM18 ()). In supervision relation extraction, zeng2014 () applies convolutional neural networks to model sentences and import position feature for RE, which obtains significant gains in RE performance. Afterwards, Yu2014 (); rank2015 (); lin2016 () further introduce more advanced deep learning models for RE. In distant supervision relation extraction, zeng2015 () proposes a piecewise convolutional neural network with multiinstance learning for DS based relation extraction, which improves the precision and recall significantly. Afterwards, lin2016 () introduces the attention mechanism (attention1 (); attention2 ()) to merge the sentence features aiming to construct better bag representations. lin2017 () further proposes a multilingual neural relation extraction framework considering the information consistency and complementarity among crosslingual texts. However, the two deep learning based models only make separated extraction thus can not model class ties between relations. Recently, zeng2016incorporating () proposes to incorporate relation paths for distant supervision relation extraction and ji2017distant () introduces to use the description of entities to enhance distant supervision relation extraction. chen2018encoding () proposes a joint inference approach by encoding implicit relation requirements for relation extraction. Joint learning is also applied to jointly study two related tasks DBLP:journals/corr/abs190600575 (). Besides, a lot of works have been proposed in recent times to solve the wrong labelling problem. bingfeng2017 () proposes to model the noise caused by wrong labelling problem and show that dynamic transition matrix can effectively characterize the noises. qin2018dsgan (); han2018denoising () propose to use adversarial learning goodfellow2014generative () to solve the wrong labelling problem. Instead, DBLP:conf/aaai/FengHZYZ18 (); DBLP:conf/acl/WangXQ18a () adopt reinforcement learning to learn to select highquality data for training. DBLP:conf/emnlp/LiuWCS17 () dynamically corrects the wrong labeled data during training by exploiting semantic information from labeled entity pairs. DBLP:conf/emnlp/LiuZZJ18 () transfers the priori knowledge learned from relevant entity classification task to make the model robust to noisy data.
2.2 Deep Learning to Rank
Learning to rank (LTR) is an important technique in information retrieval (IR) liu2009 (). The methods to train a LTR model include pointwise, pairwise and listwise. We apply pairwise LTR in our paper. Deep learning to rank has been widely used in many problems to serve as a classification model. In image retrieval, zhao2015deep () applies deep semantic ranking for multilabel image retrieval. In text matching, severyn2015learning () adopts learning to rank combined with deep CNN for short text pairs matching. In traditional supervised relation extraction, rank2015 () designs a pairwise loss function based on CNN for single label relation extraction. Based on the advantage of deep learning to rank, we propose pairwise learning to rank (LTR) liu2009 () combined with CNN in our model aiming to jointly extract multiple relations.
2.3 Costsensitive Learning
Costsensitive learning is one of the techniques for class imbalance problem, which assigns higher wrong classification costs to classes with small proportion. For example, shen2015deepcontour () proposes a regularized softmax to deal with the imbalanced edge label classification. khan2015cost () adopts costsensitive learning to learn deep feature representations from imbalanced data. Another approach to relieve class imbalance problem is resampling huang2016learning (); imbalance () including oversampling and undersampling, which aims to balance the distributions of data in different labels.
This paper is the extension of yehaiacl2017 (). Compared to original work in yehaiacl2017 (), this paper has several improvements:
Methods: (a) We further fully consider the class imbalance problem. We propose a novel ranking based costsensitive loss function combined with multilabel learning. (b) To better learn class ties between relations, we further introduce a regularization term to ranking loss functions.
Experiments: (a) We further do experiments to analyze the effectiveness of our novel costsensitive ranking loss functions. (b) The evaluation experiments on the effectiveness of regularization have further be conducted.
Content: (a) We rewrite the description of our methods from the view of multilabel learning and costsensitive learning to gain more theoretical justification improvement.
3 Methodology
We introduce our methods in this section. Firstly, we describe the widely used CNN architecture for sentence encoding. Then we discuss the ranking based multilabel learning framework with regularization technique. After that, we introduce the proposed costsensitive learning to overcome the NR effects for model training.
3.1 Notation
We define the relation classes as , entity tuples as and mentions^{1}^{1}1The sentence containing one certain entity is called mention. as . Dataset is constructed as follows: for entity tuple and its relation class set , we collect all the mentions that contain , the dataset we use is . Given a data , the sentence embeddings of encoded by CNN are defined as and we use class embeddings to represent the relation classes, which will be learned in model training.
3.2 CNN for sentence embedding
We take the effective piecewise CNN architecture adopted from zeng2015 (); lin2016 () to encode sentence and we will briefly introduce PCNN in this section. More details of PCNN can be obtained from previous work.
3.2.1 Words Representations
Word Embedding Given a word embedding matrix where is the size of word dictionary and is the dimension of word embedding, the words of a mention will be represented by realvalued vectors from .
Position Embedding The position embedding of a word measures the distance from the word to entities in a mention. We add position embeddings into words representations by appending position embedding to word embedding for every word. Given a position embedding matrix where is the number of distances and is the dimension of position embeddings, the dimension of words representations becomes .
3.2.2 Convolution, Piecewise maxpooling
After transforming words in to realvalued vectors, we get the sentence . The set of kernels is where is the number of kernels. Define the window size as and given one kernel , the convolution operation is defined as follows:
(1) 
where is the vector after conducting convolution along for times and is the bias vector. For these vectors whose indexes out of range of , we replace them with zero vectors.
By piecewise maxpooling, when pooling, the sentence is divided into three parts: , and ( and are the positions of entities, is the beginning of sentence and is the end of sentence). This piecewise maxpooling is defined as follows:
(2) 
where is the result of mention processed by kernel ; . Given the set of kernels , following the above steps, the mention can be embedded to where .
3.2.3 NonLinear Layer, Regularization
To learn highlevel features of mentions, we apply a nonlinear layer after pooling layer. After that, a dropout layer is applied to prevent overfitting. We define the final fixed sentence representation as ().
(3) 
where is a nonlinear function and we use in this paper; is a Bernoulli random vector with probability p to be .
3.3 Combine Information across Sentences
We propose two options to combine sentences to provide enough information for multilabel learning.
AVE The first option is average method. This method regards all the sentences equally and directly average the values in all dimensions of sentence embedding. This AVE function is defined as follows:
(4) 
where is the number of sentences and is the bag representation combining all sentence embeddings. Because it weights the importance of sentences equally, this method may bring much noise data from two aspects: (1) the wrong labelling data; (2) irrelated mentions for one relation class, for all sentences containing the same entity tuple being combined together to construct the bag representation.
ATT The second one is a sentencelevel attention algorithm used by lin2016 () to measure the importance of sentences aiming to relieve the wrong labelling problem. For every sentence, ATT will calculate a weight by comparing the sentence to one relation. We first calculate the similarity between one sentence embedding and relation class as follows:
(5) 
where is the similarity between sentence embedding and relation class and a is a bias factor. In this paper, we set as . Then we apply to rescale () to . We get the weight for as follows:
(6) 
so the function to merge with ATT is as follows:
(7) 
3.4 Learning Class Ties via Ranking based Multilabel Learning with Regularization
Firstly, we have to present the score function to measure the similarity between bag representation and relation .
Score Function We use dot function to produce score for to be predicted as relation . The score function is as follows:
(8) 
There are other options for score function. In MultiATT (), they propose a margin based loss function that measures the similarity between and by distance. Because score function is not an important issue in our model, we adopt dot function, also used by rank2015 () and lin2016 (), as our score function.
Now we start to introduce the ranking loss functions.
Pairwise ranking aims to learn the score function that ranks positive classes higher than negative ones. This goal can be summarized as follows:
(9) 
where is a margin factor which controls the minimum margin between the positive scores and negative scores. Inspired by rank2015 (), given and , we adopt the following function to learn the score function:
(10) 
where , is the rescale factor, is positive margin and is negative margin. This loss function is designed to rank positive classes higher than negative ones controlled by the margin of . In reality, will be higher than and will be lower than . In our work, we set as , as and as adopted from rank2015 (). To simplify the loss functions given in the followings, we use to replace the first term in and use to replace the second term.
To model the class ties (cooccurrence) of the labels, we have the assumption that the positive labels have the same class ties and are connected with each other. Out of this assumption, we have two mechanisms to learn the class ties, which are making joint extraction of relations and explicitly modeling the connections by regularizing the learning of positive labels. In the followings, we will first introduce the loss functions for multilabel learning extended from Eq. 10; then we discuss the regularization term.
To learn class ties between relations, we firstly extend the Eq. 10 to make multilabel learning. Followings are the proposed ranking based loss functions:
with AVE (Variant1) We define the marginbased loss function with option of AVE to aggregate sentences as follows:
(11) 
Similar to Weston2011 () and rank2015 (), we update one negative class at every training round but to balance the loss between positive classes and negative ones, we multiply before the right term in Eq. 11 to expand the negative loss. We apply minibatch based stochastic gradient descent (SGD) to minimize the loss function. The negative class is chosen as the one with highest score among all negative classes rank2015 (), i.e.:
(12) 
with ATT (Variant2) Now we define the loss function for the option of ATT to combine sentences as follows:
(13) 
where means the attention weighted representation where attention weights are merged by comparing sentence embeddings with relation class and is chosen by the following function:
(14) 
which means we update one negative class in every training round. We keep the values of , and same as values in Eq. 11. In Eq. 13, for every , we need to sample according to Eq. 14, so different from Eq. 11, we do not extend the negative loss by multiplying .
According to this loss function, we can see that: for each class , it will capture the most related information from sentences to merge , then rank higher than all negative scores which each is (). We use the same update algorithm to minimize this loss.
Based on the assumption that all positive labels have the same class ties, making joint extraction of the relations can capture the cooccurrence of the labels. If the relations for the same entity pair usually appear together, then extracting them jointly can learn the statistical property of their coappearance.
Regularization To learn the class ties between relations, we have proposed the ranking based loss functions above. Inspired by zhouMIML (); MIML2005 (), we further capture the relation connections by adding an extra regularization term to the loss functions. We only consider the relatedness between positive labels ignoring NR. The relatedness is measured by the mean function :
(15) 
where . is the center of the labels, and we hope the positive labels can be close to the center which can be measured by:
(16) 
Following zhouMIML (), to model the class ties we need to minimize the loss function as follows:
(17) 
where and are hyperparameters. Eq. 17 is designed based on the consideration that the labels in which class ties exist should be clustered together and should be close to the center of these labels. According to Eq. 15, Eq. 16 can be rewritten as:
(18) 
By merging Func. 18 into Eq. 17, we have the our final regularization term:
(19) 
In this paper, we set as and is set as .
Pro. (%)  Training  Test 
Riedel  72.52  96.26 
3.5 Ranking based Costsensitive Multilabel Learning
In relation extraction, the dataset will always contain certain negative samples which do not express any relation types and are classified as NR type (no relation). Table 1 presents the proportion of NR samples in the dataset from 2010 (), which shows that the almost data is about NR. Data imbalance will severely affect the model training and cause the model only sensitive to classes with high proportion imbalance (), causing a positive sample to be classified as NR. In order to relieve this problem, we adopt costsensitive learning to construct the loss function. Based on , the costsensitive loss function which is Variant3 is as follows:
(20) 
where ; is an indicate function. Similar to Eq. 14, we select as follows:
(21) 
Because NR counts a high proportion in the training set, without controlling, the model will receive large costs from NR. In order to relieve the effects from NR, we penalize the losses from NR. Specifically, we have two strategies to do that. We adopt two hyperparameters which are () and to penalize the losses from NR. If is a positive label, to balance the costs between the positive labels and the NR label, we further add the costs from the left positive relations and at the same time, the extra cost from NR is calculated. The default value of is and if is small enough, this loss function will be similar to loss Eq. 13. Based on the experimental results, we find that the best results are achieved when is set to , so we set as in this paper. How the and affect model performance is discussed in Sec. 4.5 and Sec. 4.6. We also add the regularization term to to better capture the class ties between relations.
We give out the pseudocode of merging in algorithm .
4 Experiments
In this section, we conduct two sets of experiments, in which the first one is for comparing our method with the baselines and the second one is used to evaluate our model. Without the special statement, we will adhere to the methods and settings mentioned above to conduct the following experiments.
4.1 Dataset and Evaluation Criteria
Dataset. We conduct our experiments on a widely used dataset, developed by 2010 () and has been used by 2011 (); 2012 (); zeng2015 (); lin2016 (). The dataset aligns Freebase relation facts with the New York Times corpus, in which training mentions are from 20052006 corpus and test mentions from 2007. The training set contains 522,611 sentences, 281,270 entity pairs and 18,252 relation facts. In test set, there are 172,448 sentences, 96,678 entity pairs and 1,950 relation facts. In all, there are 53 relation labels including the NR relation. Following 2009 (), we adopt heldout evaluation framework in all experiments. We use all training dataset to train our model and then test the trained model on test dataset to compare the predicted relations to gold relations.
Evaluation Criteria. To evaluate the model performance, we draw the precision/recall (P/R) curves and precision@N (P@N) is reported to illustrate the model performance. For the metric of P/R curve, the bigger of the area contained under the curve, the better of the model performance.
4.2 Experimental Settings
Word Embeddings. We adopt the trained word embeddings from lin2016 (). Similar to lin2016 (), we keep the words that appear more than times to construct word dictionary and use “UNK” to represent the other ones.
Hyperparameter Settings. Threefold validation on the training dataset is adopted to tune the parameters following 2012 (). We select word embedding size from . Batch size is tuned from . We determine learning rate among . The window size of convolution is tuned from . We keep other hyperparameters same as zeng2015 (): the number of kernels is , position embedding size is and dropout rate is . Table 2 shows the detailed parameter settings.
Parameter Name  Symbol  Value 
Window size  
Sentence. emb. dim.  
Word. emb. dim.  
Position. emb. dim.  
Batch size  
Learning rate  
Dropout pos. 
4.3 Comparisons with Baselines
Baseline. We compare our model with the following baselines:
Mintz 2009 () is the first original model which incorporates distant supervision for relation extraction.
MultiR 2011 () is the multiinstance learning based graphical model which aims to address overlapping relation problem.
MIML 2012 () is a multiinstance multilabel framework which jointly considers the wrong labelling problem and overlapping problem.
PCNN+ATT lin2016 () is the previous stateoftheart model in dataset of 2010 () which applies sentencelevel attention to relieve the wrong labelling problem in DS based relation extraction. This model applies piecewise convolutional neural network zeng2015 () to model sentences.
Besides comparing to the above methods, we also compare our variant models represented by Rank+AVE (using loss function of ), Rank+ATT (using loss of )and Rank+Cost (using loss of ).
Results and Discussion. We compare our three variants of loss functions with the baselines and the results are shown in Fig. 3. From the results we can see that:

Rank+AVE (Variant1) lags behind PCNN+ATT, whose reason may lie in that Rank+AVE does not use the attention mechanism to aggregate the information among the sentences, which brings much noise for encoding sentence contexts;

After adopting the attention mechanism, Rank+ATT achieves much better performances comparing to Rank+AVE, and even better than PCNN+ATT;

Comparing PCNN+ATT and Rank+ATT, we can see that Rank+ATT is superior to PCNN+ATT, which comes from the strategy that we model the class ties into the relation extraction;

Our variant method of Rank+Cost achieves the best performance among all the baselines; by comparing to Rank+ATT, our costsensitive learning method can really work for relieving the negative effects from NR.
4.4 Impact of Class Ties
In this section, we conduct experiments to reveal the effectiveness of our model to learn class ties with three variant loss functions mentioned above, and the impact of class ties for relation extraction. As mentioned above, we adopt two techniques to model the class ties: multilabel learning with ranking based loss functions and regularization term to better model class ties. In the followings, we will conduct experiments to reveal the two aspects for modeling class ties. We will adopt P/R curves and precisions@N (, , , ) to show the model performances.
Ranking based Loss Function. The effectiveness of ranking loss functions to learn class ties lies in the joint extraction of relations to conduct multilabel leaning, so to reveal the impact of ranking loss function to learn class ties, we will compare the joint extraction with separated extraction. Regularization term is added to all variant models. To conduct the experiment of separated extraction, we divide the labels of entity tuple into single label and for one relation label we select the sentences expressing this relation to construct the bag, then we use the reconstructed dataset to train our model with our three variant loss functions.
P@N(%)  100  200  300  400  500  Ave. 
R.+AVE+J.  
R.+AVE+S.  80.2  74.9  72.2  67.8  64.0  71.8 
R.+ATT+J.  86.8  78.4  75.2  71.1  78.4  
R.+ATT+S.  82.7  75.3  
R.+ExATT+J.  86.8  83.2  81.1  76.7  73.5  80.3 
R.+ExATT+S.  76.3 
Experimental results are shown in Fig. 4 and Table 3. From the results we can see that: (1) For Rank+ATT and Rank+Cost, joint extraction exhibits better performance than separated extraction, which demonstrates class ties will improve relation extraction and the two methods are effective to learn class ties; (2) For Rank+AVE, surprisingly joint extraction does not keep up with separated extraction. For the second phenomenon, it may come from the strategy of AVE method to aggregate sentences. To make joint extraction, we will combine all the sentences containing the same entity tuple, however, not all sentences have the same relation, the fact is that one part of the sentences express one relation type and some will have another one. Simply averaging the sentence representations will hinder the model to learn the latent mapping from the sentences to the corresponding relation type, because averaging operation will gender redundant information from other unrelated sentences.
P@N(%)  100  200  300  400  500  Ave. 
R.+AVE+noregu.  66.5  64.0  
R.+AVE+regu.  79.1  73.8  70.4  70.5  
R.+ATT+noregu.  
R.+ATT+regu.  86.8  80.6  78.4  75.2  71.1  78.4 
R.+Cost+noregu.  
R.+Cost+regu.  86.8  83.2  81.1  76.7  73.5  80.3 
Regularization. To see the impact of regularization technique for modeling class ties, we compare the methods using regularization with the ones without using regularization. All variant models are in setting of joint extraction. The results are shown in Fig. 5 and Table 4. From the results, we can see that after regularizing the learning of relations, the model performance can be further improved indicated by methods of Rank+Cost and Rank+ATT, which demonstrates the effectiveness of regularization to model class ties. We do not see many effects of regularization for method of Rank+AVE. Noises brought by averaging sentence embeddings may hinder the positive effects of regularization.
P@N(%)  100  200  300  400  500  Ave. 
ATT  
ATT+NR  
ATT+P  
ATT+NR+P 
4.5 Impact of Costsensitive Learning
In this section, we conduct experiments to reveal the effectiveness of costsensitive learning to relieve the impact of NR for model training and model performance. For the loss function of , we have two parts for costsensitive learning: the first is the one penalized by , and the second is the NR cost penalized by . Based on the loss function of Variant3, we respectively relieve the cost controlled by and the cost of NR controlled by to see the impact of costsensitive learning. We will adopt P/R curves and precisions@N (, , , ) to show the model performances.
The results are shown in Fig. 6 and Table 5. From the results, we can see that considering the cost controlled by can sightly improve the performance in low recall range and considering the cost of NR controlled by can boost the performance significantly. Considering both of the two kinds of costs can achieve the best performance. From these results, we can see that relieving NR impact is really important to improve the extraction performance.
4.6 Impact of NR
From the discussion above, we can know that NR can have much significant impact for model performance, so in this section, we conduct more experiments to reveal the impact of NR cost controlled by for model performance.
Effect of Penalty. We conduct experiments on the choice of . Based on the loss function of Variant3, we select from to see how much effect of NR can gender to the performance. We also adopt P/R curves and precisions@N (, , , ) to show the model performances. Models are set with joint extraction and regularization. The results are shown in Fig. 7 and Table 6. From the results we can find that when becomes larger (from to ), the model performance will decrease because NR will have more negative impact on model performance, so in order to achieve better model performance, the value of should be set smaller.
P@N(%)  100  200  300  400  500  Ave. 
=  
=  
=  
= 
Effect of NR for Model Convergence. Then we further evaluate the impact of NR for convergence behavior of our model in model training. Also with the three variant loss functions, in each iteration, we record the maximal value of Fmeasure ^{2}^{2}2 to represent the model performance at current epoch. Models are with setting of joint extraction but without regularization. Model parameters are tuned for times and the convergence curves are shown in Fig. 8. From the result, we can find out: “+NR” converges quicker than “NR” and arrives to the final score at the around or epoch. In general, “NR” converges more smoothly and will achieve better performance than “+NR” in the end.
5 Conclusion and Future Works
In this work, we propose a ranking based costsensitive multilabel learning for distant relation extraction aiming to leverage class ties to enhance relation extraction and relieving class imbalance problem. To exploit class ties between relations to improve relation extraction, we propose a general ranking based multilabel learning framework combined with convolutional neural networks, in which ranking based loss functions with regularization technique are introduced to learn the latent connections between relations. Furthermore, to deal with the problem of class imbalance in distant supervision relation extraction, we further adopt costsensitive learning to rescale the costs from the positive and negative labels. In the experimental study, we further do experiments to analyze the effectiveness of our novel costsensitive ranking loss functions. The evaluation experiments on the effectiveness of regularization have further be conducted.
In the future, we will focus on the following aspects: (1) Our method in this paper considers pairwise intersections between labels, so to better exploit class ties, we will extend our method to exploit all other labels’ influences on each relation for relation extraction, transferring secondorder to highorder zhang2014review (); (2) We will regard the task of distant supervision relation extraction as a multiinstance based learningtorank problem, and will take the view from learningtorank to design the algorithms and combine other advanced tricks from information retrieval field; (3) What effects will entity pairs take to the relation extraction performance? Can we use a general entity pair replacement (, ) to represent all entity pairs? Answering the two problems may help the transfer learning of RE systems.
Acknowledgment
This work was supported by the National Hightech Research and Development Program (863 Program) (No. 2014AA015105) and National Natural Science Foundation of China (No. 61602490).
References
References
 (1) M. Mintz, S. Bills, R. Snow, D. Jurafsky, Distant supervision for relation extraction without labeled data, in: Proceedings of ACLIJCNLP, 2009.
 (2) K. D. Bollacker, C. Evans, P. Paritosh, T. Sturge, J. Taylor, Freebase: a collaboratively created graph database for structuring human knowledge, in: Proceedings of KDD, 2008, pp. 1247–1250.
 (3) R. Hoffmann, C. Zhang, X. Ling, L. Zettlemoyer, D. S. Weld, Knowledgebased weak supervision for information extraction of overlapping relations, in: Proceedings of ACLHLT, 2011.
 (4) M. Surdeanu, J. Tibshirani, R. Nallapati, C. D. Manning, Multiinstance multilabel learning for relation extraction, in: Proceedings of EMNLP, 2012.
 (5) J. Fürnkranz, E. Hüllermeier, E. L. Mencía, K. Brinker, Multilabel classification via calibrated label ranking, Machine learning 73 (2) (2008) 133–153.
 (6) M.L. Zhang, Z.H. Zhou, Multilabel neural networks with applications to functional genomics and text categorization, IEEE transactions on Knowledge and Data Engineering 18 (10) (2006) 1338–1351.
 (7) Z.H. Zhou, M.L. Zhang, S.J. Huang, Y.F. Li, Multiinstance multilabel learning, Artificial Intelligence 176 (1) (2012) 2291–2320.
 (8) T. Evgeniou, C. A. Micchelli, M. Pontil, Learning multiple tasks with kernel methods, Journal of Machine Learning Research 6 (Apr) (2005) 615–637.
 (9) N. Japkowicz, S. Stephen, The class imbalance problem: A systematic study, Intelligent data analysis 6 (5) (2002) 429–449.
 (10) H. Zheng, Z. Li, S. Wang, Z. Yan, J. Zhou, Aggregating intersentence information to enhance relation extraction, in: Thirtieth AAAI Conference on Artificial Intelligence, 2016.
 (11) Y. Lin, S. Shen, Z. Liu, H. Luan, M. Sun, Neural relation extraction with selective attention over instances, in: Proceedings of ACL, 2016.
 (12) Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (7553) (2015) 436–444.
 (13) C. N. d. Santos, B. Xiang, B. Zhou, Classifying relations by ranking with convolutional neural networks, in: Proceeding of ACL, 2015.
 (14) S. Riedel, L. Yao, A. McCallum, Modeling relations and their mentions without labeled text, in: Proceedings of ECMLPKDD, Springer, 2010, pp. 148–163.
 (15) X. Han, L. Sun, Global distant supervision for relation extraction, in: Proceedings of AAAI, 2016.
 (16) D. Zeng, K. Liu, S. Lai, G. Zhou, J. Zhao, et al., Relation classification via convolutional deep neural network., in: Proceeding of COLING, 2014.
 (17) M. G. Yu Mo, M. Dredze, Factorbased compositional embedding models, in: NIPS Workshop on Learning Semantics, 2014.
 (18) H. Ye, Z. Yan, Z. Luo, W. Chao, Dependencytree based convolutional neural networks for aspect term extraction, in: Advances in Knowledge Discovery and Data Mining  21st PacificAsia Conference, PAKDD 2017, Jeju, South Korea, May 2326, 2017, Proceedings, Part II, 2017.
 (19) H. Ye, L. Wang, Semisupervised learning for neural keyphrase generation, in: Proceedings of Empirical Methods in Natural Language Processing, 2018.
 (20) H. Ye, X. Jiang, Z. Luo, W. Chao, Interpretable charge predictions for criminal cases: Learning to generate court views from fact descriptions, CoRR abs/1802.08504.
 (21) X. Jiang, H. Ye, Z. Luo, W. Chao, W. Ma, Interpretable rationale augmented charge prediction system, in: The 27th International Conference on Computational Linguistics: System Demonstrations, 2018.
 (22) D. Zeng, K. Liu, Y. Chen, J. Zhao, Distant supervision for relation extraction via piecewise convolutional neural networks, in: Proceedings of EMNLP, 2015.
 (23) T. Luong, H. Pham, C. D. Manning, Effective approaches to attentionbased neural machine translation, in: Proceedings of EMNLP, 2015.
 (24) D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, arXiv preprint arXiv:1409.0473.
 (25) Y. Lin, Z. Liu, M. Sun, Neural relation extraction with multilingual attention, in: Proceedings of Association for Computational Linguistics, 2017.
 (26) W. Zeng, Y. Lin, Z. Liu, M. Sun, Incorporating relation paths in neural relation extraction, arXiv preprint arXiv:1609.07479.
 (27) G. Ji, K. Liu, S. He, J. Zhao, Distant supervision for relation extraction with sentencelevel attention and entity descriptions., in: AAAI, 2017, pp. 3060–3066.
 (28) L. Chen, Y. Feng, S. Huang, B. Luo, D. Zhao, Encoding implicit relation requirements for relation extraction: A joint inference approach, Artificial Intelligence 265 (2018) 45–66.
 (29) H. Ye, W. Li, L. Wang, Jointly learning semantic parser and natural language generator via dual information maximization, CoRR abs/1906.00575.
 (30) B. Luo, Y. Feng, Z. Wang, Z. Zhu, S. Huang, R. Yan, D. Zhao, Learning with noise: Enhance distantly supervised relation extraction with dynamic transition matrix, in: Proceedings of Association for Computational Linguistics, 2017.
 (31) P. Qin, W. Xu, W. Y. Wang, Dsgan: Generative adversarial training for distant supervision relation extraction, arXiv preprint arXiv:1805.09929.
 (32) X. Han, Z. Liu, M. Sun, Denoising distant supervision for relation extraction via instancelevel adversarial training, arXiv preprint arXiv:1805.10959.
 (33) I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Advances in neural information processing systems, 2014.
 (34) J. Feng, M. Huang, L. Zhao, Y. Yang, X. Zhu, Reinforcement learning for relation classification from noisy data, in: Proceedings of the ThirtySecond AAAI Conference on Artificial Intelligence, (AAAI18), the 30th innovative Applications of Artificial Intelligence (IAAI18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI18), New Orleans, Louisiana, USA, February 27, 2018, 2018.
 (35) P. Qin, W. Xu, W. Y. Wang, Robust distant supervision relation extraction via deep reinforcement learning, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 1520, 2018, Volume 1: Long Papers, 2018.
 (36) T. Liu, K. Wang, B. Chang, Z. Sui, A softlabel method for noisetolerant distantly supervised relation extraction, in: Proceedings of Empirical Methods in Natural Language Processing, 2017.
 (37) T. Liu, X. Zhang, W. Zhou, W. Jia, Neural relation extraction via innersentence noise reduction and transfer learning, in: Proceedings of Empirical Methods in Natural Language Processing, 2018.
 (38) T.Y. Liu, Learning to rank for information retrieval, Foundations and Trends in Information Retrieval 3 (3) (2009) 225–331.
 (39) F. Zhao, Y. Huang, L. Wang, T. Tan, Deep semantic ranking based hashing for multilabel image retrieval, in: Proceedings of CVPR, 2015.
 (40) A. Severyn, A. Moschitti, Learning to rank short text pairs with convolutional deep neural networks, in: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 2015, pp. 373–382.
 (41) W. Shen, X. Wang, Y. Wang, X. Bai, Z. Zhang, Deepcontour: A deep convolutional feature learned by positivesharing loss for contour detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
 (42) S. H. Khan, M. Bennamoun, F. Sohel, R. Togneri, Cost sensitive learning of deep feature representations from imbalanced data, arXiv preprint arXiv:1508.03422.
 (43) C. Huang, Y. Li, C. Change Loy, X. Tang, Learning deep representation for imbalanced classification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
 (44) H. He, E. A. Garcia, Learning from imbalanced data, IEEE Trans. Knowl. Data Eng. 21 (9) (2009) 1263–1284.
 (45) H. Ye, W. Chao, Z. Luo, Z. Li, Jointly extracting relations with class ties via effective deep ranking, in: Proceedings of Association for Computational Linguistics, 2017.
 (46) L. Wang, Z. Cao, G. de Melo, Z. Liu, Relation classification via multilevel attention cnns, in: Proceedings of ACL, Volume 1: Long Papers, 2016.
 (47) J. Weston, S. Bengio, N. Usunier, WSABIE: scaling up to large vocabulary image annotation, in: Proceedings of IJCAI, 2011.
 (48) M.L. Zhang, Z.H. Zhou, A review on multilabel learning algorithms, IEEE transactions on knowledge and data engineering 26 (8) (2014) 1819–1837.