# Bag-of-Vector Embeddings of Dependency Graphs for Semantic Induction

## Abstract

Vector-space models, from word embeddings to neural network parsers, have many advantages for NLP. But how to generalise from fixed-length word vectors to a vector space for arbitrary linguistic structures is still unclear. In this paper we propose bag-of-vector embeddings of arbitrary linguistic graphs. A bag-of-vector space is the minimal nonparametric extension of a vector space, allowing the representation to grow with the size of the graph, but not tying the representation to any specific tree or graph structure. We propose efficient training and inference algorithms based on tensor factorisation for embedding arbitrary graphs in a bag-of-vector space. We demonstrate the usefulness of this representation by training bag-of-vector embeddings of dependency graphs and evaluating them on unsupervised semantic induction for the Semantic Textual Similarity and Natural Language Inference tasks.

## 1 Introduction

Word embeddings have made a big contribution to recent advances in NLP. By representing discrete words in a continuous vector space and by learning semantically meaningful vectors from distributions in unannotated text, they are able to capture semantic similarities between words. But generalising this success to models of the meaning of phrases and sentences has proven challenging. If we continue to use fixed-length vectors to encode arbitrarily long sentences, then we inevitably lose information as we scale up to larger semantic structures (e.g. Adi et al. (2016); Blacoe and Lapata (2012); Mitchell and Lapata (2010); Socher et al. (2011); Kiros et al. (2015); Le and Mikolov (2014); Li et al. (2015); Cho et al. (2014); Sutskever et al. (2014)). If we only use vectors to label the nodes of a traditional linguistic structure, then we lose the advantages of encoding discrete structures in a continuous space where similarity between structures can be captured.

In this paper we investigate the minimal extension of a vector space which allows it to embed arbitrarily large structures, namely a bag-of-vector space. These bag-of-vector embeddings (BoVEs) are nonparametric representations, meaning that the size of the representation (i.e. the number of vectors in the bag) can grow with the size of the structure that needs to be embedded. In our case, we assume that the number of vectors is the same as the number of nodes in the structure’s graph. But no other information about the graph has a discrete representation. All properties and relations in the graph are encoded in the continuous values of the vectors.

We propose methods for mapping graphs to BoVE representations and for learning these mappings. To take full advantage of the flexibility of the BoVE representation, we want these mappings to embed arbitrary graphs. For this reason, we propose tensor factorisation algorithms, both for training a BoVE model and for inferring a BoVE given a graph and a model. Like the Word2Vec model of word embeddings Mikolov et al. (2013), tensor factorisation uses a reconstruction loss, where each observed relation is predicted independently conditioned on the latent vector representation. This conditional independence allows tensor factorisation to model arbitrary graphs. As well as stochastic gradient descent, we propose efficient alternating least squares algorithms for optimising this reconstruction loss, both for training models and for inferring BoVEs for a new graph.

As an example of the usefulness of these algorithms, we learn BoVE models for embedding dependency parses of sentences, such as in Figure 1. In these embeddings, each vector corresponds to a word token in the sentence. We can think of these token vectors as context-dependent word embeddings. The BoVE model learns to embed dependency relations by adding features to these token vectors which specify the features of the tokens it is related to. So, each token vector encodes information about its context in the graph, as well as information about its word.

We evaluate this property by comparing our BoVE embeddings to bags of Word2Vec word embeddings. Initialising a BoVE model with these same word embeddings, we train a model of relations and infer context-dependent vectors for each word token. We then evaluate these two bag-of-vector representations in unsupervised models of two sentence-level semantic tasks, Semantic Textual Similarity (STS) and the Stanford Natural Language Inference (SNLI) dataset. Results show that training a BoVE model does extract semantically meaningful information from the distributions in a corpus of syntactically-parsed text.

In the rest of this paper, we define our tensor factorisation algorithm for learning the parameters of a model, and define our inference algorithm for computing an embedding for a graph given these parameters. We evaluate these algorithms on there ability to extend word embeddings, using two tasks which demonstrate that the resulting bag-of-vector embeddings induce a semantic representation which is more informative than word embeddings.

## 2 Bag-of-Vector Embeddings

Our proposed method for embedding graphs in a bag-of-vector space has two steps: first we learn the parameters of a model from a large corpus of example graphs, then we compute the embedding of a new graph using those parameters. The training phase uses correlations in the data to find good compressions into the limited capacity of each vector. It is the encoding of these correlations in the final embeddings which makes these representations semantically meaningful.

We propose tensor factorisation algorithms for both learning a BoVE model and inferring BoVEs given a model. In tensor factorisation, the graph is encoded as a tensor for relations and a matrix for properties, as illustrated in Figure 1. The relations in the graph are encoded as an entity-by-label-by-entity tensor of indicator variables, where entities are the nodes of the graph and a 1 indicates a relation with the given label between the given entities. All other cells in the tensor are zero. The properties of nodes in the graph are encoded as an entity-by-label matrix of indicator variables, where a 1 indicates a property with the given label for the given entity. As illustrated in Figure 2a, the embedding learns to reconstruct each cell of the matrix as the dot product between the vector for the entity and the vector for the property, and learns to reconstruct each cell of the tensor as the dot product between the vectors for the two entities and a matrix for the relation. In this work we assume squared loss for each of these reconstruction predictions, because it leads to efficient inference procedures.

Training the tensor factorisation model results in vectors for all the properties, plus matrices for all the relations. In our setting the properties are words and part-of-speech (PoS) tags, and the relations are syntactic dependencies and string adjacency. We refer to these as the type embeddings. At test time, given an input graph, we freeze these type embeddings and infer one vector for each node in the graph. In our setting the nodes are tokens of words in the sentence. These are the token vectors. The bag of token vectors is the embedding of the graph.

### 2.1 The BoVE Model

Our use of tensor factorisation is closely related to the RESCAL model Nickel et al. (2011, 2012). RESCAL is designed as a model of large relational databases, with large numbers of entities, properties of entities and (binary) relations between entities. Unlike in the RESCAL setting, the entities in our datasets are partitioned into sentences. No relations are possible between tokens in different sentences, so representing a corpus as one big database where anything can be related to anything is not appropriate. Also unlike in RESCAL, our target use case is transductive, not abductive; we want to train the parameters of a model which can then be applied to compute the embedding of a previously unseen parsed sentence. And we want the inference necessary at test time to be fast, which excludes the possibility of re-factorising the training set.

Given a set of sentence graphs , we define the tensor dimension sizes , , and as in Figure 2b. The data is encoded in indicator tensors , also specified in Figure 2b. For each sentence graph , there is a 2-dimensional tensor that indicates which unary predicates label which tokens, and a 3-dimensional tensor that indicates which binary relations exist between two tokens. We want to factorise these tensors into the real-valued embedding tensors for unary predicates, for binary relations, and for the tokens in each sentence , also defined in Figure 2b.

This tensor factorisation is depicted in Figure 2a. The objective of the factorisation is to be able to reconstruct and from , and , for all :

(1) | ||||

(2) |

We use quadratic loss for efficiency reasons. Adding quadratic () regularisation, we get the objective function:

(3) | ||||

where is a hyperparameters determining the relative importance of embedding relations versus embedding properties, and , and are hyperparameters that determine the strength of regularisation.

In addition to regularisation, we have also tried regularisation and nuclear-norm regularisation for and . To do nuclear-norm regularisation on , we run SVD on each relation’s slice of , apply regularisation on the resulting eigenvalues, and then reconstruct the slice of with the reduced eigenvalues. This in effect regularises the rank of the matrix for each relation, which is particularly motivated because predicting labelled relations is relatively easy. There are a very large number of words, so the full rank of the model is needed to predict words. But most relations are much easier to predict, and thus need a much smaller rank.

Training our model optimises the above objective to find values for , and values for for all training graphs in the training corpus . At testing time we are given a new graph and we optimise a new while keeping fixed. The rows of the matrix are the vector representations of the entities in . The rows of are exchangeable, which is why we refer to them as a bag-of-vectors.

### 2.2 Training the BoVE Model

Given the objective in equation 3, it is straightforward to define a stochastic gradient descent (SGD) algorithm for training a BoVE model. In this section, we focus on an alternating least squares (ALS) optimisation algorithms for training BoVE models. Provided that the size of the vectors is not too large, this algorithm is much faster than SGD.

This ALS algorithm is inspired by the RESCAL algorithm Nickel et al. (2011, 2012). Like the RESCAL algorithm, we cycle between updating and given , and updating given and . For and , there is a closed form solution to find the exact minimum of the objective. To find the optimal given and , we need to iteratively compute the least-squares solution for a new value of given the old value of . We do a few steps of this iteration for each update of and , as discussed below. First we give details of the closed form solution for updating and .

To update , we construct one matrix whose rows are the concatenation () of the rows from the matrices for all the sentences .

So has rows and as many columns as there are tokens in the training corpus. Similarly, we construct a matrix , which has columns and as many rows as tokens in the corpus.

We then find the optimal to minimise the regularised quadratic loss for

Using standard linear algebra, this gives us:

Note that this requires computing the inverse of an matrix.

To update , we first represent as a matrix by listing each slice of the matrix in a row vector by concatenating its rows ().

So has rows and columns. Secondly, we map each sentence’s tensor into a matrix by enumerating pairs of tokens for each column of , so each matrix becomes row of .

Thirdly, we map each sentence’s embedding matrix into a larger matrix with columns and a row for every pair of tokens in the sentence. If is the pair, then the row of is the vectorisation of the outer product between the embedding of and the embedding of . We can write this formally using the Kronecker product :

We then concatenate the matrices for the different sentences as done above for updating , to get and .

Based on the equation , finding the optimal can then be done by finding the optimal to minimise the regularised quadratic loss for

Which gives us:

Note that this least-squares problem requires finding the inverse of an matrix, which becomes very expensive as becomes large. Our experiments so far have been done with ranks where this is not the limiting factor ().

To update , we take advantage of the fact that the tokens in different sentences cannot be in relations with each other. This allows us to optimise the embeddings for all sentences independently of each other. Updating each is an instance of the same problem as updating the entity embeddings in RESCAL, only for a smaller relation tensor and property matrix. We use a modified version of the update procedure of RESCAL, applying it once for each sentence. The RESCAL procedure updates using the old version of for one side of the relation tensor. This gives three sets of equations which need to be optimised, one for each occurrence of in Equations 1 and 2.

where is the old version and is the new matrix we want to optimise. Adding our weighting to the equations for the RESCAL procedure we get the following solution to the regularised least squares objective:

(4) | ||||

where

and concatenates the rows of , and .

However, we found that with this procedure, the new values tend to overcompensate for the errors in the old values , resulting in oscillation. This may be because, in our setting, the vectors for are much bigger than they need to be to represent ; most of that capacity is only needed for representing . We addressed this problem by running two consecutive iterations of this optimisation, keeping and fixed, to get and . As our new value of , we used the average of and :

(5) |

This average (in ) is a better approximation to the combined effect (in ) of the and matrices than using the matrix alone (as in RESCAL). This average can be seen as a two-step process which first infers and separately and then projects into a sub-space where they are equal. We run one such average-of-two update of all in in between each update of and .

Overall, our objective in this alternating least squares training is only to find good values of and . As discussed in the next subsection, at test time we are given a new sentence’s and , and we want to compute for this new sentence, keeping and fixed to the values learned during training. To make the training setup as similar as possible to the testing setup, in both cases we initialise the model with zero values for . In training, we initialise with random values and with zeros. We also periodically re-initialise to zeros and run several iterations of updating (as in testing) before returning to the alternating least squares optimisation described above. We stop training when the squared loss stops improving by more than 0.1% at each iteration.

### 2.3 Inference of a BoVE for a Graph

Because we are assuming a transductive learning setting, at test time we are given a new sentence and its and , and we want to compute for this new sentence, keeping and fixed to the values learned during training. The objective function for this inference of remains the same as in training (equation 3), so again this optimisation can be done with either SGD or ALS. But because there is no need to optimise , there is no need to compute the inverse of an matrix, making ALS faster even for larger embedding sizes .

We initialise to zero values, and run several iterations of the ALS optimisation procedure described in section 2.2, using equations 4 and 5. During the first update, is not effected by and , so each entity only receives features from its properties in .

In the second update, these features are combined with features propagated from the entity’s immediately related entities. With each update, features from farther away in the graph have an impact on the entity’s embedding. After the first update, we apply the averaging procedure described in section 2.2 to every update, applying equation 5 with the previous average as and the new result of equation 4 as .

While the number of iterations performed during testing could be determined in a number of ways, in our experiments we simply use a fixed number of iterations (30).

## 3 Related Work

As outlined above, the proposed model is closely related to RESCAL, which was developed for learning embeddings for entities in a large relational database. Our model differs in that it learns from many small graphs, rather than one big one, and it is targeted at computing embeddings for new entities not in the training set. Riedel et al. (2013) combine entities from a knowledge base with entities in text and jointly factorises them. But they do not use tensor factorisation methods, and the above two differences also apply.

Neural network (NN) models have been used to compute embeddings for parsed sentences. They either use a fixed-length vector for an arbitrarily long sentence (e.g. Socher et al. (2011)), or they keep the original sentence structure and decorate it with vectors (e.g. Henderson (2003); Socher et al. (2013)). Ours is the first non-parametric vector-based model that does not keep the entire structure. However, attention-based NN models could be used in this way. For example in machine translation, Bahdanau et al. (2014) take a source sequence of words and encode it as a sequence of vectors, which are then used in an attention-based NN model to generate the target sequence of words for the translation. Keeping the original ordering of words is not fundamental to this method, and thus it could be interpreted as a bag-of-vector embedding method for sequences. However, it is not at all clear how to generalise such a sequence-encoding method to arbitrary graphs. Similarly, Kalman filters have been used to induce vector representations of word tokens in their sequential context Belanger and Kakade (2015), but it is not clear how to generalise this to arbitrary graphs.

## 4 Empirical Evaluation

As an example of the usefulness of the proposed algorithms and BoVE representations, we evaluate them for unsupervised semantic induction. We train BoVE models for embedding dependency parses of sentences, and use the resulting BoVEs in unsupervised models of Semantic Textual Similarity (STS) Agirre et al. (2014, 2015) and the Stanford Natural Language Inference (SNLI) dataset Bowman et al. (2015). Training is done on a standard treebank, without looking at the data for the task. Then the sentences for the task data are parsed and the BoVEs for these parses are inferred. Then the semantic relationship between two sentences is predicted using an alignment between the elements in the two BoVEs.

In BoVEs of dependency parses, each vector corresponds to a word token in the sentence, and encodes both the features of that word and its context. As a strong baseline, we use Word2Vec embeddings as representations that just encode features of the word, without its context in the sentence. This baseline is a good representative of the state-of-the-art in unsupervised semantic induction. Thus, we compare two bag-of-vector representations, one a bag of word-embeddings, and another a bag-of-vector embedding of the parsed sentence.

To provide a direct comparison to this word-embeddings baseline, we initialise the word type embeddings in our BoVE model to Word2Vec word embeddings. This also has the advantage that the model can leverage the fact that these embeddings have been trained on a very large corpus. The word type vectors are then frozen, and the other type embeddings (PoS vectors, dependency matrices and an adjacency matrix) are trained to optimise the regularised reconstruction loss in equation 3, using the CoNLL 2009 syntactic dependency corpus. These trained parameters are then also frozen and used to infer BoVEs of the test sentences.

Given two bag-of-vector representations for two sentences, we use an alignment-based model to predict either similarity (for STS) or entailment (for SNLI) between the two sentences. In both cases, the score for the pair is the score of the best alignment between the pair of bags, and the evaluation measure is a function of the ranked list of these scores.

### 4.1 Experimental Setup

#### Training corpus

As the training corpus, we use the CoNLL 2009 Hajič et al. (2009) syntactic dependency parses, derived from the Penn Treebank (PTB) collection of parsed and PoS-tagged Wall Street Journal texts. It consists of 40k parsed sentences, 69 unique syntactic dependencies, 20k unique word types and 1 million word tokens, with an average of 25 word tokens per sentence. We also add 1 adjacency relation to the parse graph. We impose a frequency threshold of 2 for word types and PoS tags and 1000 for syntactic dependencies. Word types that appear with frequency lower than the threshold are replaced by an ’UNKNOWN_POS’ tag, where POS stands for the part of speech tag associated to the word. Similarly all infrequent PoS tags are replaced by the tag ’UNKNOWN_POSTAG’ and all infrequent relations are replaced by an ’UNKNOWN_RELATION’ tag. We use a generic NB tag to replace all numbers and a PUNCT tag for all punctuation signs. For each sentence, we populate a matrix and a tensor of indicator variables with the corresponding information.

#### Training the BoVE model

At training time, we fix the word type embeddings to their corresponding
pre-trained GoogleNews Word2Vec values whenever
available.^{1}

#### Inferring the BoVE of a sentence

At inference time, we obtain a BoVE for each sentence, given fixed word type and relation embeddings learned during the training phase. All sentences are tokenised, PoS tagged and parsed and then token embeddings are inferred using the alternating least squares (ALS) method from Sections 2.2 and 2.3.

### 4.2 Scores for Pairs of BoVEs

Both STS and SNLI are tasks where we need to score pairs of sentences. After inferring a BoVE for each of the sentences, we predict the semantic relationship between the sentences by computing a score based on an alignment between the elements in the two BoVEs.

For SNLI, we want the score to reflect how well a sentence is entailed by a sentence , so we use an asymmetric alignment where all the words in are aligned to some word in :

where is the cosine between the two vectors. This score has the property that it grows with the length of , since each component of the sum is positive. In the SNLI data, we observe a negative correlation between the entailed sentence size and the entailment score, indicating that a pair of sentences is more likely to be considered in an entailment relation if the second sentence is shorter. This correlation seems to be an idiosyncrasy of the dataset, and because we are only considering unsupervised models, modelling this correlation should not be allowed. For this reason, we only consider scores which are independent of sentence length. To make the cosine score independent of sentence length, we divide the entailment score between the 2 sentences by the length of the entailed sentence and use this as our scoring function for evaluations.

(6) |

This means that, for each word of the entailed sentence, we find a word
in the entailing sentence that best entails it, and then average over these
alignment scores.^{2}

For STS, we want the score to reflect the similarity between and , which is a symmetric relation. As is typically done with alignment-based measures, to get a symmetric score, we compute the asymmetric score in equation 6 in both directions, and then use the harmonic mean between these two scores.

(7) |

### 4.3 Semantic Textual Similarity

Semantic Textual Similarity (STS) Agirre et al. (2014, 2015) is a shared task for systems designed to measure the degree of semantic equivalence of pairs of texts. Most submissions to the STS tasks use supervised models that are trained and tuned on the provided training data or on similar datasets from earlier versions of the task, and many use additional knowledge resources (e.g. Sultan et al. (2015)). We use this data in a fully unsupervised setting, where our only external resources are corpora of raw of parsed text. We use the STS-2014 data as a development set to evaluate BoVEs trained with a few different hyperparameters settings. The best results are reported in the top half of Table 1. The STS-2014 dataset consists of 6 subsets covering different domains, which we report separately, along with their average score. We then evaluated this one best model on the STS-2015 data. Results are shown in the bottom half of Table 1. The STS-2015 dataset consists of 5 subsets covering different domains: answers-forums (Q&A in public forums), answers-students, belief, headlines (news headlines) and images (image captions). We report the standard evaluation measure provided within the SemEval-2015 Task 2 Agirre et al. (2015) based on the mean Pearson correlation between the gold scores and the predicted scores.

Corpus | pairs | Bag of W2V | BoVE |
---|---|---|---|

deft-forum 2014 | 450 | 0.4334 | 0.4386 |

deft-news 2014 | 300 | 0.6583 | 0.6795 |

images 2014 | 750 | 0.7398 | 0.7410 |

headlines 2014 | 750 | 0.6166 | 0.6301 |

OnWN 2014 | 750 | 0.6857 | 0.6800 |

tweet-news 2014 | 750 | 0.7235 | 0.7187 |

2014 mean | 0.6429 | 0.6480 | |

answers-forums | 375 | 0.6133 | 0.6174 |

answers-students | 750 | 0.7123 | 0.7145 |

belief | 375 | 0.7384 | 0.7295 |

headlines | 750 | 0.6904 | 0.7033 |

images | 750 | 0.8008 | 0.8073 |

2015 mean | 0.7110 | 0.7144 |

As can be seen in Table 1, the BoVE model shows an improvement over the Word2Vec model in four out of six datasets for STS-2014 and in four out of five datasets for STS-2015, and in both cases in the average across datasets. These differences are not statistically significant.

### 4.4 Natural Language Inference

The Stanford Natural Language Inference (SNLI) corpus Bowman et al. (2015) is a benchmark for the evaluation of systems on the task of textual entailment. It consists of 570k human-written English sentence pairs labelled as entailment, contradiction or neutral and divided into pre-defined train, development and test sets. While most approaches using SNLI keep the 3-class structure, we focus on detecting whether two sentences are in an entailment relation or not and thus combine the ’neutral’ and ’contradiction’ labels under one single ’non-entailment’ label.

For these evaluations, we used the same BoVE model which gave the best results on the development set for STS, and ran evaluations on both the SNLI development set and the SNLI test sets. We infer the BoVEs for the sentences in the SNLI data using this model, and then each pair of sentences is assigned a score using equation 6. These scores are ranked, and we report results in terms of average precision of these rankings. The results are shown in Table 2.

Corpus | pairs | Bag of W2V | BoVE |
---|---|---|---|

SNLI-dev | 10000 | 64.47% | 65.74% |

SNLI-test | 10000 | 63.04% | 64.01% |

As can be seen in Table 2, the BoVE model shows an improvement over the Word2Vec model on both datasets. These differences are statistically significant.

## 5 Conclusions

This paper proposes methods for training and inferring bag-of-vector embeddings of linguistic graphs. The above empirical results indicate that these BoVE models succeed in inducing semantic information from a corpus of parsed text. In particular, the way the BoVE model embeds information about the syntactic context of a word token results in a better measure of semantic similarity than using word embeddings, as reflected in better unsupervised models of Semantic Textual Similarity and the Stanford Natural Language Inference dataset.

In addition, several theoretical properties motivate the proposed algorithms for learning a model of embedding graphs in a bag-of-vectors and for inferring the bag-of-vector embedding of a graph given such a model. The use of bag-of-vector spaces as the representation eliminates the need to maintain discrete relationships as part of the representation, but still allows the embedding of arbitrarily large graphs in arbitrarily large (nonparametric) representations. The use of reconstruction loss as the objective allows these methods to be applied to arbitrary graphs. The alternating-least-squares algorithms scale well to large datasets and make inference at test time efficient.

Future work includes the use of the trained BoVE representations in supervised semantic tasks. In this context, BoVEs are a natural match with attention shifting neural network models, where their content-based access to vectors in the bag eliminates the need for other data structures, such as a stack or tape. This approach should allow many NLP tasks which traditionally rely on discrete structured representations to take advantage of the continuous space of similarities provided by bag-of-vector embeddings.

### Footnotes

- https://code.google.com/archive/p/word2vec/
- Note that these maximums over individual alignments are also global maximums, since the individual alignments are independent.

### References

- Yossi Adi, Einat Kermany, Yonatan Belinkov, Ofer Lavi, and Yoav Goldberg. 2016. Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. arXiv preprint arXiv:1608.04207 .
- Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Rada Mihalcea, German Rigau, and Janyce Wiebe. 2014. Semeval-2014 task 10: Multilingual semantic textual similarity. In Proceedings of the 8th international workshop on semantic evaluation (SemEval 2014). Association for Computational Linguistics Dublin, Ireland, pages 81–91.
- Eneko Agirre, Carmen Banea, Claire Cardie, Daniel M. Cer, Mona T. Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Iñigo Lopez-Gazpio, Montse Maritxalar, Rada Mihalcea, German Rigau, Larraitz Uria, and Janyce Wiebe. 2015. Semeval-2015 task 2: Semantic textual similarity, english, spanish and pilot on interpretability. In SemEval@NAACL-HLT.
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. In ICLR 2015.
- David Belanger and Sham Kakade. 2015. A linear dynamical system model for text. In Proceedings of the 32nd International Conference on Machine Learning - Volume 37. JMLR.org, ICML’15, pages 833–842. http://dl.acm.org/citation.cfm?id=3045118.3045208.
- William Blacoe and Mirella Lapata. 2012. A comparison of vector-based representations for semantic composition. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, Stroudsburg, PA, USA, EMNLP-CoNLL ’12, pages 546–556. http://dl.acm.org/citation.cfm?id=2390948.2391011.
- Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics.
- Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR abs/1406.1078. http://arxiv.org/abs/1406.1078.
- John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12:2121–2159. http://dl.acm.org/citation.cfm?id=1953048.2021068.
- Jan Hajič, Massimiliano Ciaramita, Richard Johansson, Daisuke Kawahara, Maria Antònia Martí, Lluís Màrquez, Adam Meyers, Joakim Nivre, Sebastian Padó, Jan Štěpánek, Pavel Straňák, Mihai Surdeanu, Nianwen Xue, and Yi Zhang. 2009. The conll-2009 shared task: Syntactic and semantic dependencies in multiple languages. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL 2009): Shared Task. Association for Computational Linguistics, Boulder, Colorado, pages 1–18. http://www.aclweb.org/anthology/W/W09/W09-1201.
- James Henderson. 2003. Inducing history representations for broad coverage statistical parsing. In Proc. joint meeting of North American Chapter of the Association for Computational Linguistics and the Human Language Technology Conf.. Edmonton, Canada, pages 103–110.
- Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2015. Skip-thought vectors. CoRR abs/1506.06726. http://arxiv.org/abs/1506.06726.
- Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (ICML-14). pages 1188–1196.
- Jiwei Li, Minh-Thang Luong, and Dan Jurafsky. 2015. A hierarchical neural autoencoder for paragraphs and documents. CoRR abs/1506.01057. http://arxiv.org/abs/1506.01057.
- Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2015. Learning context-sensitive word embeddings with neural tensor skip-gram model. In Proceedings of the 24th International Conference on Artificial Intelligence. AAAI Press, IJCAI’15, pages 1284–1290. http://dl.acm.org/citation.cfm?id=2832415.2832428.
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, Curran Associates, Inc., pages 3111–3119. http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf.
- Jeff Mitchell and Mirella Lapata. 2010. Composition in distributional models of semantics. Cognitive Science 34(8):1388–1439.
- Arvind Neelakantan, Jeevan Shankar, Alexandre Passos, and Andrew McCallum. 2015. Efficient non-parametric estimation of multiple embeddings per word in vector space. CoRR abs/1504.06654. http://arxiv.org/abs/1504.06654.
- Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. 2011. A three-way model for collective learning on multi-relational data. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), ACM, pages 809–816.
- Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. 2012. Factorizing yago: scalable machine learning for linked data. In Proceedings of the 21st international conference on World Wide Web. ACM, pages 271–280.
- Sebastian Riedel, Limin Yao, Andrew McCallum, and Benjamin M Marlin. 2013. Relation extraction with matrix factorization and universal schemas. In Proceedings of NAACL-HLT 2013. pages 74–84.
- Richard Socher, John Bauer, Christopher D Manning, and Andrew Y Ng. 2013. Parsing with compositional vector grammars. In ACL (1). pages 455–465.
- Richard Socher, Eric H. Huang, Jeffrey Pennington, Andrew Y. Ng, and Christopher D. Manning. 2011. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In Proceedings of the 24th International Conference on Neural Information Processing Systems. Curran Associates Inc., USA, NIPS’11, pages 801–809. http://dl.acm.org/citation.cfm?id=2986459.2986549.
- Md Arafat Sultan, Steven Bethard, and Tamara Sumner. 2015. Dls@ cu: Sentence similarity from word alignment and semantic vector composition. In SemEval. pages 148–153.
- Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. CoRR abs/1409.3215. http://arxiv.org/abs/1409.3215.