Learning with Noise: Enhance Distantly Supervised Relation Extraction with Dynamic Transition Matrix

Learning with Noise: Enhance Distantly Supervised Relation Extraction with Dynamic Transition Matrix


Distant supervision significantly reduces human efforts in building training data for many classification tasks. While promising, this technique often introduces noise to the generated training data, which can severely affect the model performance. In this paper, we take a deep look at the application of distant supervision in relation extraction. We show that the dynamic transition matrix can effectively characterize the noise in the training data built by distant supervision. The transition matrix can be effectively trained using a novel curriculum learning based method without any direct supervision about the noise. We thoroughly evaluate our approach under a wide range of extraction scenarios. Experimental results show that our approach consistently improves the extraction results and outperforms the state-of-the-art in various evaluation scenarios.



Figure 1: Overview of our approach

1 Our approach

In order to deal with the noisy training data obtained through \DS, our approach follows four steps as depicted in Figure 1. First, each input sentence is fed to a sentence encoder to generate an embedding vector. Our model then takes the sentence embeddings as input and produce a predicted relation distribution, , for the input sentence (or the input sentence bag). At the same time, our model dynamically produces a transition matrix, , which is used to characterize the noise pattern of sentence (or the bag). Finally, the predicted distribution is multiplied by the transition matrix to produce the observed relation distribution, , which is used to match the noisy relation labels assigned by \DSwhile the predicted relation distribution serves as output of our model during testing. One of the key challenges of our approach is on determining the element values of the transition matrix, which will be described in Section LABEL:sec:training.

1.1 Sentence-level Modeling

Sentence Embedding and Prediction

In this work, we use a piecewise convolutional neural network [Zeng et al.(2015)Zeng, Liu, Chen, and Zhao] for sentence encoding, but other sentence embedding models can also be used. We feed the sentence embedding to a full connection layer, and use softmax to generate the predicted relation distribution, .

Noise Modeling

First, each sentence embedding , generated b sentence encoder, is passed to a full connection layer as a non-linearity to obtain the sentence embedding used specifically for noise modeling. We then use softmax to calculate the transition matrix , for each sentence:


where is the conditional probability for the input sentence to be labeled as relation by \DS, given as the true relation, is a scalar bias, is the number of relations, is the weight vector characterizing the confusion between and .

Here, we dynamically produce a transition matrix, , specifically for each sentence, but with the parameters () shared across the dataset. By doing so, we are able to adaptively characterize the noise pattern for each sentence, with a few parameters only. In contrast, one could also produce a global transition matrix for all sentences, with much less computation, where one need not to compute on the fly (see Section LABEL:sec:results_in_TimeRE).

Observed Distribution

When we characterize the noise in a sentence with a transition matrix , if its true relation is , we can assume that might be erroneously labeled as relation by \DSwith probability . We can therefore capture the observed relation distribution, , by multiplying and the predicted relation distribution, :


where is then normalized to ensure .

Rather than using the predicted distribution to directly match the relation labeled by \DS [Zeng et al.(2015)Zeng, Liu, Chen, and Zhao, Lin et al.(2016)Lin, Shen, Liu, Luan, and Sun], here we utilize to match the noisy labels during training and still use as output during testing, which actually captures the procedure of how the noisy label is produced and thus protects from the noise.

1.2 Bag Level Modeling

Bag Embedding and Prediction

One of the key challenges for bag level model is how to aggregate the embeddings of individual sentences into the bag level. In this work, we experiment two methods, namely average and attention aggregation [Lin et al.(2016)Lin, Shen, Liu, Luan, and Sun]. The former calculates the bag embedding, , by averaging the embeddings of each sentence, and then feed it to a softmax classifier for relation classification.

The attention aggregation calculates an attention value, , for each sentence in the bag with respect to each relation , and aggregates to the bag level as , by the following equations1:


where is the embedding of sentence , the number of sentences in the bag, and is the randomly initialized embedding for relation . In similar spirit to [Lin et al.(2016)Lin, Shen, Liu, Luan, and Sun], the resulting bag embedding is fed to a softmax classifier to predict the probability of relation for the given bag.

Noise Modeling

Since the transition matrix addresses the transition probability with respect to each true relation, the attention mechanism appears to be a natural fit for calculating the transition matrix in bag level. Similar to attention aggregation above, we calculate the bag embedding with respect to each relation using Equation 3, but with a separate set of relation embeddings . We then calculate the transition matrix, , by:


where is the bag embedding regarding relation , and is the embedding for relation .

2 Evaluation Methodology

Our experiments aim to answer two main questions: (1) is it possible to model the noise in the training data generated through \DS, even when there is no prior knowledge to guide us? and (2) whether the prior knowledge of data quality can help our approach better handle the noise.

We apply our approach to both sentence level and bag level extraction models, and evaluate in the situations where we do not have prior knowledge of the data quality as well as where such prior knowledge is available.

2.1 Datasets

We evaluate our approach on two datasets.


We build \TimeREby using \DSto align time-related Wikidata [Vrandečić and Krötzsch(2014)] \KBtriples to Wikipedia text. It contains 278,141 sentences with 12 types of relations between an entity mention and a time expression. We choose to use time-related relations because time expressions speak for themselves in terms of reliability. That is, given a \KBtriple , rel, and its aligned sentences, the finer-grained the time expression appears in the sentence, the more likely the sentence supports the existence of this triple. For example, a sentence containing both Alphabet and October-2-2015 is very likely to express the inception-time of Alphabet, while a sentence containing both Alphabet and 2015 could instead talk about many events, e.g., releasing financial report of 2015, hiring a new CEO, etc. Using this heuristics, we can split the dataset into 3 subsets according to different granularities of the time expressions involved, indicating different levels of reliability. Our criteria for determining the reliability are as follows. Instances with full date expressions, i.e., Year-Month-Day, can be seen as the most reliable data, while those with partial date expressions, e.g., Month-Year and Year-Only, are considered as less reliable. Negative data are constructed heuristically that any entity-time pairs in a sentence without corresponding triples in Wikidata are treated as negative data. During training, we can access 184,579 negative and 77,777 positive sentences, including 22,214 reliable, 2,094 and 53,469 less reliable ones. The validation set and test set are randomly sampled from the reliable (full-date) data for relatively fair evaluations and contains 2,776, 2,771 positive sentences and 5,143, 5,095 negative sentences, respectively.


is a widely-used entity relation extraction dataset, built by aligning triples in Freebase to the New York Times (\NYT) corpus [Riedel et al.(2010)Riedel, Yao, and McCallum]. It contains 52 relations, 136,947 positive and 385,664 negative sentences for training, and 6,444 positive and 166,004 negative sentences for testing. Unlike \TimeRE, this dataset does not contain any prior knowledge about the data quality. Since the sentence level annotations in \EntityREare too noisy to serve as gold standard, we only evaluate bag-level models on \EntityRE, a standard practice in previous works [Surdeanu et al.(2012)Surdeanu, Tibshirani, Nallapati, and Manning, Zeng et al.(2015)Zeng, Liu, Chen, and Zhao, Lin et al.(2016)Lin, Shen, Liu, Luan, and Sun].

2.2 Experimental Setup


We use 200 convolution kernels with widow size 3. During training, we use stochastic gradient descend (SGD) with batch size 20. The learning rates for sentence-level and bag-level models are 0.1 and 0.01, respectively.

Sentence level experiments are performed on \TimeRE, using 100-d word embeddings pre-trained using GloVe [Pennington et al.(2014)Pennington, Socher, and Manning] on Wikipedia and Gigaword [Parker et al.(2011)Parker, Graff, Kong, Chen, and Maeda], and 20-d vectors for distance embeddings. Each of the three subsets of \TimeREis added after the previous phase has run for 15 epochs. The trace regularization weights are , and , respectively, from the reliable to the most unreliable, with the ratio of and fixed to 10 or 5 when tuning.

Bag level experiments are performed on both \TimeREand \EntityRE. For \TimeRE, we use the same parameters as above. For \EntityRE, we use 50-d word embeddings pre-trained on the \NYTcorpus using word2vec [Mikolov et al.(2013)Mikolov, Sutskever, Chen, Corrado, and Dean], and 5-d vectors for distance embedding. For both datasets, and in Eq. LABEL:general_loss are initialized to 1 and 0.1, respectively. We tried various decay rates, {0.95, 0.9, 0.8}, and steps, {3, 5, 8}. We found that using a decay rate of 0.9 with step of 5 gives best performance in most cases.

Evaluation Metric

The performance is reported using the precision-recall (\PR) curve, which is a standard evaluation metric in relation extraction. Specifically, the extraction results are first ranked decreasingly by their confidence scores, then the precision and recall are calculated by setting the threshold to be the score of each extraction result one by one.

Naming Conventions

We evaluate our approach under a wide range of settings for sentence level (sent_) and bag level (bag_) models: (1) _mix: trained on all three subsets of \TimeREmixed together; (2) _reliable: trained using the reliable subset of \TimeREonly; (3) _PR: trained with prior knowledge of annotation quality, i.e., starting from the reliable data and then adding the unreliable data; (4) _TM: trained with dynamic transition matrix; (5) _GTM: trained with a global transition matrix. In bag level, we also investigate the performance of average aggregation (_avg) and attention aggregation (_att).

3 Conclusions

In this paper, we investigate the noise problem inherent in the \DS-style training data. We argue that the data speak for themselves by providing useful clues to reveal their noise patterns. We thus propose a novel transition matrix based method to dynamically characterize the noise underlying such training data in a unified framework along the original prediction objective. One of our key innovations is to exploit a curriculum learning based training method to gradually learn to model the underlying noise pattern without direct guidance, and to provide the flexibility to exploit any prior knowledge of the data quality to further improve the effectiveness of the transition matrix. We evaluate our approach in two learning settings of the distantly supervised relation extraction. The experimental results show that the proposed method can better characterize the underlying noise and consistently outperform start-of-the-art extraction models under various scenarios.


This work is supported by the National High Technology R&D Program of China (2015AA015403); the National Natural Science Foundation of China (61672057, 61672058); KLSTSPI Key Lab. of Intelligent Press Media Technology; the UK Engineering and Physical Sciences Research Council under grants EP/M01567X/1 (SANDeRs) and EP/M015793/1 (DIVIDEND); and the Royal Society International Collaboration Grant (IE161012).


  1. While [Lin et al.(2016)Lin, Shen, Liu, Luan, and Sun] use bilinear function to calculate , we simply use dot product since we find these two functions perform similarly in our experiments.


  1. Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In ICML. ACM, pages 41–48.
  2. Xinlei Chen and Abhinav Gupta. 2015. Webly supervised learning of convolutional networks. In ICCV. pages 1431–1439.
  3. Meng Fang and Trevor Cohn. 2016. Learning when to trust distant supervision: An application to low-resource pos tagging using cross-lingual projection. In CONLL. pages 178–186.
  4. Alec Go, Richa Bhayani, and Lei Huang. 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford 1(12).
  5. Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, and Daniel S Weld. 2011. Knowledge-based weak supervision for information extraction of overlapping relations. In Proceedings of ACL. pages 541–550.
  6. Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan, and Maosong Sun. 2016. Neural relation extraction with selective attention over instances. In ACL. volume 1, pages 2124–2133.
  7. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS. pages 3111–3119.
  8. Bonan Min, Ralph Grishman, Li Wan, Chang Wang, and David Gondek. 2013. Distant supervision for relation extraction with an incomplete knowledge base. In HLT-NAACL. pages 777–782.
  9. Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In ACL. pages 1003–1011.
  10. Ishan Misra, C Lawrence Zitnick, Margaret Mitchell, and Ross Girshick. 2016. Seeing through the human reporting bias: Visual classifiers from noisy human-centric labels. In CVPR. pages 2930–2939.
  11. Robert Parker, David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. 2011. English gigaword fifth edition, linguistic data consortium. Technical report, Linguistic Data Consortium, Philadelphia.
  12. Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In EMNLP. volume 14, pages 1532–1543.
  13. Scott Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and Andrew Rabinovich. 2014. Training deep neural networks on noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596 .
  14. Sebastian Riedel, Limin Yao, and Andrew McCallum. 2010. Modeling relations and their mentions without labeled text. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, pages 148–163.
  15. Alan Ritter, Alan Ritter, Sam Clark, Oren Etzioni, et al. 2011. Named entity recognition in tweets: an experimental study. In EMNLP. Association for Computational Linguistics, pages 1524–1534.
  16. Alan Ritter, Luke Zettlemoyer, Mausam, and Oren Etzioni. 2013. Modeling missing data in distant supervision for information extraction. TACL 1:367–378.
  17. Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir Bourdev, and Rob Fergus. 2015. Training convolutional networks with noisy labels. In ICLR.
  18. Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D Manning. 2012. Multi-instance multi-label learning for relation extraction. In EMNLP-CoNLL. pages 455–465.
  19. Shingo Takamatsu, Issei Sato, and Hiroshi Nakagawa. 2012. Reducing wrong labels in distant supervision for relation extraction. In ACL. pages 721–729.
  20. Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Communications of the ACM 57(10):78–85.
  21. Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xiaogang Wang. 2015. Learning from massive noisy labeled data for image classification. In CVPR. pages 2691–2699.
  22. Wei Xu, Raphael Hoffmann, Le Zhao, and Ralph Grishman. 2013. Filling knowledge base gaps for distant supervision of relation extraction. In ACL. pages 665–670.
  23. Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. 2015. Distant supervision for relation extraction via piecewise convolutional neural networks. In EMNLP. pages 1753–1762.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description