Task-Oriented Learning of Word Embeddings for Semantic Relation Classification
We present a novel learning method for word embeddings designed for relation classification. Our word embeddings are trained by predicting words between noun pairs using lexical relation-specific features on a large unlabeled corpus. This allows us to explicitly incorporate relation-specific information into the word embeddings. The learned word embeddings are then used to construct feature vectors for a relation classification model. On a well-established semantic relation classification task, our method significantly outperforms a baseline based on a previously introduced word embedding method, and compares favorably to previous state-of-the-art models that use syntactic information or manually constructed external resources.
Automatic classification of semantic relations has a variety of applications, such as information extraction and the construction of semantic networks . A traditional approach to relation classification is to train classifiers using various kinds of features with class labels annotated by humans. Carefully crafted features derived from lexical, syntactic, and semantic resources play a significant role in achieving high accuracy for semantic relation classification .
In recent years there has been an increasing interest in using word embeddings as an alternative to traditional hand-crafted features. Word embeddings are represented as real-valued vectors and capture syntactic and semantic similarity between words. For example, word2vec
In this work we present a learning method for word embeddings specifically designed to be useful for relation classification. The overview of our system and the embedding learning process are shown in Figure 1. First we train word embeddings by predicting each of the words between noun pairs using lexical relation-specific features on a large unlabeled corpus. We then use the word embeddings to construct lexical feature vectors for relation classification. Lastly, the feature vectors are used to train a relation classification model.
We evaluate our method on a well-established semantic relation classification task and compare it to a baseline based on word2vec embeddings and previous state-of-the-art models that rely on either manually crafted features, syntactic parses or external semantic resources. Our method significantly outperforms the word2vec-based baseline, and compares favorably with previous state-of-the-art models, despite relying only on lexical level features and no external annotated resources. Furthermore, our qualitative analysis of the learned embeddings shows that -grams of our embeddings capture salient syntactic patterns similar to semantic relation types.
A traditional approach to relation classification is to train classifiers in a supervised fashion using a variety of features. These features include lexical bag-of-words features and features based on syntactic parse trees. For syntactic parse trees, the paths between the target entities on constituency and dependency trees have been demonstrated to be useful . On the shared task introduced by , achieved the best score using a variety of hand-crafted features which were then used to train a Support Vector Machine (SVM).
Recently, word embeddings have become popular as an alternative to hand-crafted features . However, one of the limitations is that word embeddings are usually learned by predicting a target word in its context, leading to only local co-occurrence information being captured . Thus, several recent studies have focused on overcoming this limitation. integrated paragraph information into a word2vec-based model, which allowed them to capture paragraph-level information. For dependency parsing, and found ways to improve performance by integrating dependency-based context information into their embeddings. trained embeddings by defining parent and child nodes in dependency trees as contexts. introduced the concept of feature embeddings induced by parsing a large unannotated corpus and then learning embeddings for the manually crafted features. For information extraction, trained word embeddings relevant for event role extraction, and employed word embeddings for domain adaptation of relation extraction. Another kind of task-specific word embeddings was proposed by , which used sentiment labels on tweets to adapt word embeddings for a sentiment analysis tasks. However, such an approach is only feasible when a large amount of labeled data is available.
3Relation Classification Using Word Embedding-based Features
We propose a novel method for learning word embeddings designed for relation classification. The word embeddings are trained by predicting each word between noun pairs, given the corresponding low-level features for relation classification. In general, to classify relations between pairs of nouns the most important features come from the pairs themselves and the words between and around the pairs . For example, in the sentence in Figure 1 (b) there is a cause-effect relationship between the two nouns conflicts and players. To classify the relation, the most common features are the noun pair (conflicts, players), the words between the noun pair (are, caused, by), the words before the pair (the, external), and the words after the pair (playing, tiles, to, ...). As shown by , the words between the noun pairs are the most effective among these features. Our main idea is to treat the most important features (the words between the noun pairs) as the targets to be predicted and other lexical features (noun pairs, words outside them) as their contexts. Due to this, we expect our embeddings to capture relevant features for relation classification better than previous models which only use window-based contexts.
In this section we first describe the learning process for the word embeddings, focusing on lexical features for relation classification (Figure Figure 1 (b)). We then propose a simple and powerful technique to construct features which serve as input for a softmax classifier. The overview of our proposed system is shown in Figure 1 (a).
3.1Learning Word Embeddings
Assume that there is a noun pair in a sentence with words between the pair and words before and after the pair:
Our method predicts each target word using three kinds of information: , words around in , and words in and . Words are embedded in a -dimensional vector space and we refer to these vectors as word embeddings. To discriminate between words in from those in , , and , we have two sets of word embeddings: and . is a set of words and is also a set of words but contains only nouns. Hence, the word cause has two embeddings: one in and another in . In general cause is used as a noun and a verb, and thus we expect the noun embeddings to capture the meanings focusing on their noun usage. This is inspired by some recent work on word representations that explicitly assigns an independent representation for each word usage according to its part-of-speech tag .
A feature vector is constructed to predict by concatenating word embeddings:
and corresponds to each word and is the context size. A special NULL token is used if is smaller than or is larger than for each .
Our method then estimates a conditional probability that the target word is a word given the feature vector , using a logistic regression model:
where is a weight vector for , is a bias for , and is the logistic function. Each column vector in corresponds to a word. That is, we assign a logistic regression model for each word, and we can train the embeddings using the one-versus-rest approach to make larger than for . However, naively optimizing the parameters of those logistic regression models would lead to prohibitive computational cost since it grows linearly with the size of the vocabulary.
When training we employ several procedures introduced by , namely, negative sampling, a modified unigram noise distribution and subsampling. For negative sampling the model parameters , , , and are learned by maximizing the objective function :
where is a word randomly drawn from the unigram noise distribution weighted by an exponent of . Maximizing means that our method can discriminate between each target word and noise words given the target word’s context. This approach is much less computationally expensive than the one-versus-rest approach and has proven effective in learning word embeddings.
To reduce redundancy during training we use subsampling. A training sample, whose target word is , is discarded with the probability , where is a threshold which is set to and is a probability corresponding to the frequency of in the training corpus. The more frequent a target word is, the more likely it is to be discarded. To further emphasize infrequent words, we apply the subsampling approach not only to target words, but also to noun pairs; concretely, by drawing two random numbers and , a training sample whose noun pair is is discarded if is larger than or is larger than .
Since the feature vector is constructed as defined in Eq. (Equation 1), at each training step, is updated based on information about what pair of nouns surrounds , what word -grams appear in a small window around , and what words appear outside the noun pair. Hence, the weight vector captures rich information regarding the target word .
3.2Constructing Feature Vectors
Once the word embeddings are trained, we can use them for relation classification. Given a noun pair with its context words , , and , we construct a feature vector to classify the relation between and by concatenating three kinds of feature vectors:
the word embeddings of the noun pair,
the averaged -gram embeddings between the pair, and
the concatenation of the averaged word embeddings in and .
The feature vector is the concatenation of and :
Words between the noun pair contribute to classifying the relation, and one of the most common ways to incorporate an arbitrary number of words is treating them as a bag of words. However, word order information is lost for bag-of-words features such as averaged word embeddings. To incorporate the word order information, we first define -gram embeddings between the noun pair:
Note that can also be used and that the value used for is . As described in Section 3.1, captures meaningful information about each word and after the first embedding learning step we can treat the embeddings in as features for the words. have demonstrated that using embeddings like those in is useful in representing the words. We then compute the feature vector by averaging :
We use the averaging approach since depends on each instance. The feature vector allows us to represent word sequences of arbitrary lengths as fixed-length feature vectors using the simple operations: concatenation and averaging.
The words before and after the noun pair are sometimes important in classifying the relation. For example, in the phrase “pour into ”, the word pour should be helpful in classifying the relation. As with Eq. (Equation 1), we use the concatenation of the averaged word embeddings of words before and after the noun pair to compute the feature vector :
As described above, the overall feature vector is constructed by concatenating , , and . We would like to emphasize that we only use simple operations: averaging and concatenating the learned word embeddings. The feature vector is then used as input for a softmax classifier, without any complex transformation such as matrix multiplication with non-linear functions.
Given a relation classification task we train a softmax classifier using the feature vector described in Section 3.2. For each -th training sample with a corresponding label among predefined labels, we compute a conditional probability given its feature vector :
where is defined as , and and are the softmax parameters. is the -th element of . We then define the objective function as:
is the number of training samples and controls the L-2 regularization. is the set of parameters and is maximized using AdaGrad . We have found that dropout  is helpful in preventing our model from overfitting. Concretely, elements in are randomly omitted with a probability of at each training step. Recently dropout has been applied to deep neural network models for natural language processing tasks and proven effective .
In what follows, we refer to the above method as RelEmb. While RelEmb uses only low-level features, a variety of useful features have been proposed for relation classification. Among them, we use dependency path features  based on the untyped binary dependencies of the Stanford parser to find the shortest path between target nouns. The dependency path features are computed by averaging word embeddings from on the shortest path, and are then concatenated to the feature vector . Furthermore, we directly incorporate semantic information using word-level semantic features from Named Entity (NE) tags and WordNet hypernyms, as used in previous work . We refer to this extended method as RelEmb. Concretely, RelEmb uses the same binary features as in . The features come from NE tags and WordNet hypernym tags of target nouns provided by a sense tagger .
For pre-training we used a snapshot of the English Wikipedia
4.2Initialization and Optimization
We initialized the embedding matrices and with zero-mean gaussian noise with a variance of . and were zero-initialized. The model parameters were optimized by maximizing the objective function in Eq. (Equation 2) using stochastic gradient ascent. The learning rate was set to and linearly decreased to during training, as described in . The hyperparameters are the embedding dimensionality , the context size , the number of negative samples , the initial learning rate , and , the number of words outside the noun pairs. For hyperparameter tuning, we first fixed to and to , and then set to , to , and to .
At the supervised learning step, we initialized and with zeros. The hyperparameters, the learning rate for AdaGrad, , , and the number of iterations, were determined via 10-fold cross validation on the training set for each setting. Note that can be tuned at the supervised learning step, adapting to a specific dataset.
We evaluated our method on the SemEval 2010 Task 8 data set
Financial [stress] is one of the main causes of [divorce]
The [burst] has been caused by water hammer [pressure]
Training example (a) is classified as Cause-Effect(E, E) which denotes that E is an effect caused by E, while training example (b) is classified as Cause-Effect(E, E) which is the inverse of Cause-Effect(E, E). We report the official macro-averaged F1 scores and accuracy.
To empirically investigate the performance of our proposed method we compared it to several baselines and previously proposed models.
Random and word2vec Initialization
Rand-Init. The first baseline is RelEmb itself, but without applying the learning method on the unlabeled corpus. In other words, we train the softmax classifier from Section 3.3 on the labeled training data with randomly initialized model parameters.
W2V-Init. The second baseline is RelEmb using word embeddings learned by word2vec. More specifically, we initialize the embedding matrices and with the word2vec embeddings. Related to our method, word2vec has a set of weight vectors similar to when trained with negative sampling and we use these weight vectors as a replacement for . We trained the word2vec embeddings using the CBOW model with subsampling on the full Wikipedia corpus. As with our experimental settings, we fix the learning rate to , and investigate several hyperparameter settings. For hyperparameter tuning we set the embedding dimensionality to , the context size to , and the number of negative samples to .
A simple approach to the relation classification task is to use SVMs with standard binary bag-of-words features. The bag-of-words features included the noun pairs and words between, before, and after the pairs, and we used LIBLINEAR
Neural Network Models
used Recursive Neural Network (RNN) models to classify the relations. Subsequently, and proposed RNN models to better handle the relations. These methods rely on syntactic parse trees.
introduced their novel Factor-based Compositional Model (FCM) and presented results from several model variants, the best performing being FCM and FCM. The former only uses word embedding information and the latter relies on dependency paths and NE features, in addition to word embeddings.
used a Convolutional Neural Network (CNN) with WordNet hypernyms. Noteworthy in relation to the RNN-based methods, the CNN model does not rely on parse trees. More recently, have introduced CR-CNN by extending the CNN model and achieved the best result to date. The key point of CR-CNN is that it improves the classification score by omitting the noisy class “Other” in the dataset described in Section 5.1. We call CR-CNN using the “Other” class CR-CNN and CR-CNN omitting the class CR-CNN.
5.3Results and Discussion
|Features for classifiers||F1 / ACC ()|
|RelEmb||embeddings, dependency paths, WordNet, NE||83.5 / 79.9|
|RelEmb||embeddings||82.8 / 78.9|
|RelEmb (W2V-Init)||embeddings||81.8 / 77.7|
|RelEmb (Rand-Init)||embeddings||78.2 / 73.5|
|SVM||bag of words||76.5 / 72.0|
|SVM||bag of words, POS, dependency paths, WordNet,||82.2 / 77.9|
|paraphrases, TextRunner, Google -grams, etc.|
|CR-CNN||embeddings, word position embeddings||84.1 / n/a|
|FCM||embeddings, dependency paths, NE||83.0 / n/a|
|CR-CNN||embeddings, word position embeddings||82.7 / n/a|
|CRNN||embeddings, parse trees, WordNet, NE, POS||82.7 / n/a|
|CNN||embeddings, WordNet||82.7 / n/a|
|MVRNN||embeddings, parse trees, WordNet, NE, POS||82.4 / n/a|
|FCM||embeddings||80.6 / n/a|
|RNN||embeddings, parse trees, phrase categories, etc.||79.4 / n/a|
The scores on the test set for SemEval 2010 Task 8 are shown in Table 1. RelEmb achieves 82.8 of F1 which is better than those of almost all models compared and comparable to that of the previous state of the art, except for CR-CNN. Note that RelEmb does not rely on external semantic features and syntactic parse features
Comparison with the Baselines
RelEmb significantly outperforms not only the Rand-Init baseline, but also the W2V-Init baseline. These results show that our task-specific word embeddings are more useful than those trained using window-based contexts. A point that we would like to emphasize is that the baselines are unexpectedly strong. As was noted by , we should carefully implement strong baselines and see whether complex models can outperform these baselines.
Comparison with SVM-Based Systems
RelEmb performs much better than the bag-of-words-based SVM. This is not surprising given that we use a large unannotated corpus and embeddings with a large number of parameters. RelEmb also outperforms the SVM system of , which demonstrates the effectiveness of our task-specific word embeddings, despite our only requirement being a large unannotated corpus and a POS tagger.
Comparison with Neural Network Models
RelEmb outperforms the RNN models. In our preliminary experiments, we have found some undesirable parse trees when computing vector representations using RNN-based models and such parsing errors might hamper the performance of the RNN models.
FCM, which relies on dependency paths and NE features, achieves a better score than that of RElEmb. Without such features, RelEmb outperforms FCM by a large margin. By incorporating external resources, RelEmb outperforms FCM.
RelEmb compares favorably to CR-CNN, despite our method being less computationally expensive than CR-CNN. When classifying an instance, the number of the floating number multiplications is in our method since our method requires only one matrix-vector product for the softmax classifier as described in Section 3.3. is the window size, is the word embedding dimensionality, and is the number of the classes. In CR-CNN, the number is , where is the dimensionality of the convolution layer, is the position embedding dimensionality, and is the average length of the input sentences. Here, we omit the cost of the hyperbolic tangent function in CR-CNN for simplicity. Using the best hyperparameter settings, the number is roughly in our method, and in CR-CNN assuming is 10. also boosted the score of CR-CNN by omitting the noisy class “Other” by a ranking-based classifier, and achieved the best score (CR-CNN). Our results may also be improved by using the same technique, but the technique is dataset-dependent, so we did not incorporate the technique.
5.4Analysis on Training Settings
We perform analysis of the training procedure focusing on RelEmb.
Effects of Tuning Hyperparameters
In Tables Table 2 and Table 3, we show how tuning the hyperparameters of our method and word2vec affects the classification results using 10-fold cross validation on the training set. The same split is used for each setting, so all results are comparable to each other. The best settings for the cross validation are used to produce the results reported in Table 1.
Table 2 shows F1 scores obtained by RelEmb. The results for show that RelEmb benefits from relatively large context sizes. The -gram embeddings in RelEmb capture richer information by setting to 3 compared to setting to 1. Relatively large numbers of negative samples also slightly boost the scores. As opposed to these trends, the score does not improve using . We use the best setting (, , ) for the remaining analysis. We note that RelEmb achieves an F1-score of 82.5.
We also performed similar experiments for the W2V-Init baseline, and the results are shown in Table 3. In this case, the number of negative samples does not affect the scores, and the best score is achieved by . As discussed in , the small context size captures the syntactic similarity between words rather than the topical similarity. This result indicates that syntactic similarity is more important than topical similarity for this task. Compared to the word2vec embeddings, our embeddings capture not only local context information using word order, but also long-range co-occurrence information by being tailored for the specific task.
As described in Section 3.2, we concatenate three kinds of feature vectors, , , and , for supervised learning. Table 4 shows classification scores for ablation tests using 10-fold cross validation. We also provide a score using a simplified version of , where the feature vector is computed by averaging the word embeddings of the words between the noun pairs. This feature vector then serves as a bag-of-words feature.
Table 4 clearly shows that the averaged -gram embeddings contribute the most to the semantic relation classification performance. The difference between the scores of and shows the effectiveness of our averaged -gram embeddings.
Effects of Dropout
At the supervised learning step we use dropout to regularize our model. Without dropout, our performance drops from 82.2 to 81.3 of F1 on the training set using 10-fold cross validation.
Performance on a Word Similarity Task
As described in Section 3.1, we have the noun-specific embeddings as well as the standard word embeddings . We evaluated the learned embeddings using a word-level semantic evaluation task called WordSim-353 . This dataset consists of 353 pairs of nouns and each pair has an averaged human rating which corresponds to a semantic similarity score. Evaluation is performed by measuring Spearman’s rank correlation between the human ratings and the cosine similarity scores of the embeddings. Table 5 shows the evaluation results. We used the best settings reported in Table 2 and Table 3 since our method is designed for relation classification and it is not clear how to tune the hyperparameters for the word similarity task. As shown in the result table, the noun-specific embeddings perform better than the standard embeddings in our method, which indicates the noun-specific embeddings capture more useful information in measuring the semantic similarity between nouns. The performance of the noun-specific embeddings is roughly the same as that of the word2vec embeddings.
5.5Qualitative Analysis on the Embeddings
Using the -gram embeddings in Eq. (Equation 3), we inspect which -grams are relevant to each relation class after the supervised learning step of RelEmb. When the context size is , we can use at most -grams. The learned weight matrix in Section 3.3 is used to detect the most relevant -grams for each class. More specifically, for each -gram embedding in the training set, we compute the dot product between the -gram embedding and the corresponding components in . We then select the pairs of -grams and class labels with the highest scores. In Table 6 we show the top five -grams for six classes. These results clearly show that the -gram embeddings capture salient syntactic patterns which are useful for the relation classification task.
6Conclusions and Future Work
We have presented a method for learning word embeddings specifically designed for relation classification. The word embeddings are trained using large unlabeled corpora to capture lexical features for relation classification. On a well-established semantic relation classification task our method significantly outperforms the baseline based on word2vec. Our method also compares favorably to previous state-of-the-art models that rely on syntactic parsers and external semantic resources, despite our method requiring only access to an unannotated corpus and a POS tagger. For future work, we will investigate how well our method performs on other domains and datasets and how relation labels can help when learning embeddings in a semi-supervised learning setting.
We thank the anonymous reviewers for their helpful comments and suggestions.
- Despite Enju being a syntactic parser we only use the POS tagger component. The accuracy of the POS tagger is about 97.2 on the WSJ corpus.
- The training data, the training code, and the learned model parameters used in this paper are publicly available at http://www.logos.t.u-tokyo.ac.jp/~hassy/publications/conll2015/
- While we use a POS tagger to locate noun pairs, RelEmb does not explicitly use POS features at the supervised learning step.
Mohit Bansal, Kevin Gimpel, and Karen Livescu. Tailoring Continuous Word Representations for Dependency Parsing.
Marco Baroni and Roberto Zamparelli. Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space.
Emanuela Boros, Romaric Besançon, Olivier Ferret, and Brigitte Grau. Event Role Extraction using Domain-Relevant Word Representations.
Razvan Bunescu and Raymond Mooney. A Shortest Path Dependency Kernel for Relation Extraction.
Wenliang Chen, Yue Zhang, and Min Zhang. Feature Embedding for Dependency Parsing.
Massimiliano Ciaramita and Yasemin Altun. Broad-Coverage Sense Disambiguation and Information Extraction with a Supersense Sequence Tagger.
Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural Language Processing (Almost) from Scratch.
Cicero Nogueira dos Santos, Bing Xiang, and Bowen Zhou. Classifying Relations by Ranking with Convolutional Neural Networks.
John Duchi, Elad Hazan, and Yoram Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.
Javid Ebrahimi and Dejing Dou. Chain Based RNN for Relation Classification.
Lev Finkelstein, Gabrilovich Evgenly, Matias Yossi, Rivlin Ehud, Solan Zach, Wolfman Gadi, and Ruppin Eytan. Placing Search in Context: The Concept Revisited.
Roxana Girju, Preslav Nakov, Vivi Nastase, Stan Szpakowicz, Peter Turney, and Deniz Yuret. SemEval-2007 Task 04: Classification of Semantic Relations between Nominals.
Edward Grefenstette and Mehrnoosh Sadrzadeh. Experimental Support for a Categorical Compositional Distributional Model of Meaning.
Jiang Guo, Wanxiang Che, Haifeng Wang, and Ting Liu. Revisiting Embedding Features for Simple Semi-supervised Learning.
Kazuma Hashimoto, Makoto Miwa, Yoshimasa Tsuruoka, and Takashi Chikayama. Simple Customization of Recursive Neural Networks for Semantic Relation Classification.
Kazuma Hashimoto, Pontus Stenetorp, Makoto Miwa, and Yoshimasa Tsuruoka. Jointly Learning Word Representations and Composition Functions Using Predicate-Argument Structures.
Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, Preslav Nakov, Diarmuid Ó Séaghdha, Sebastian Padó, Marco Pennacchiotti, Lorenza Romano, and Stan Szpakowicz. SemEval-2010 Task 8: Multi-Way Classification of Semantic Relations between Pairs of Nominals.
Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors.
Ozan Irsoy and Claire Cardie. Deep Recursive Neural Networks for Compositionality in Language.
Dimitri Kartsaklis and Mehrnoosh Sadrzadeh. Prior Disambiguation of Word Tensors for Constructing Sentence Vectors.
Quoc Le and Tomas Mikolov. Distributed Representations of Sentences and Documents.
Omer Levy and Yoav Goldberg. Neural Word Embedding as Implicit Matrix Factorization.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed Representations of Words and Phrases and their Compositionality.
Yusuke Miyao and Jun’ichi Tsujii. Feature Forest Models for Probabilistic HPSG Parsing.
Andriy Mnih and Koray Kavukcuoglu. Learning word embeddings efficiently with noise-contrastive estimation.
Thien Huu Nguyen and Ralph Grishman. Employing Word Representations and Regularization for Domain Adaptation of Relation Extraction.
Eric W. Noreen. Computer-Intensive Methods for Testing Hypotheses: An Introduction.
Romain Paulus, Richard Socher, and Christopher D Manning. Global Belief Recursive Neural Networks.
Bryan Rink and Sanda Harabagiu. UTD: Classifying Semantic Relations by Combining Lexical and Semantic Resources.
Richard Socher, Brody Huval, Christopher D. Manning, and Andrew Y. Ng. Semantic Compositionality through Recursive Matrix-Vector Spaces.
Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, and Bing Qin. Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification.
Joseph Turian, Lev-Arie Ratinov, and Yoshua Bengio. Word Representations: A Simple and General Method for Semi-Supervised Learning.
Sida Wang and Christopher Manning. Baselines and Bigrams: Simple, Good Sentiment and Topic Classification.
Mo Yu, Matthew R. Gormley, and Mark Dredze. Factor-based Compositional Embedding Models.
Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, and Jun Zhao. Relation Classification via Convolutional Deep Neural Network.
Min Zhang, Jie Zhang, Jian Su, and GuoDong Zhou. A Composite Kernel to Extract Relations between Entities with Both Flat and Structured Features.