Deriving a Representative Vector for Ontology Classes with Instance Word Vector Embeddings
Selecting a representative vector for a set of vectors is a very common requirement in many algorithmic tasks. Traditionally, the mean or median vector is selected. Ontology classes are sets of homogeneous instance objects that can be converted to a vector space by word vector embeddings. This study proposes a methodology to derive a representative vector for ontology classes whose instances were converted to the vector space. We start by deriving five candidate vectors which are then used to train a machine learning model that would calculate a representative vector for the class. We show that our methodology out-performs the traditional mean and median vector representations.
keywords: Ontology, Word Embedding, Representative Vector, Neural Networks, word2vec
Semantic models are used to present hierarchy and semantic meaning of concepts. Among them ontologies are a widely used superlative model extensively applied to many fields. As defined by Thomas R. Gruber , an ontology is a “formal and explicit specification of a shared conceptualization”. The use of ontologies is becoming increasingly involved in various computational tasks given the fact that ontologies can overcome limitations in traditional natural language processing methods in domains such as text classification [2, 3], word set expansions , linguistic information management [5, 6, 7, 8], medical information management [9, 10], and Information Extraction [11, 12].
However very few attempts have been made on representing ontology classes in different representations such as vector representations. The importance of having different representations for ontology classes is emphasized when it comes to ontology mapping, ontology merging, ontology integration, ontology alignment and semi automated ontology population . However sophisticated researches on representing ontology classes in different representations is still an open ended question.
In this study we propose a novel way of deriving representative vectors for ontology classes. This is an important problem in the domain of automatic ontology population and automatic ontology class labeling. We use a distributed representation of words in a vector space grouped together , achieved by means of word vector embeddings to transform the word strings in instances and the class labels to the same vector space. For this task of word embedding, we chose the neural network based method: word2vec, proposed by Tomas Mikolov et al. , which is a model architecture for computing continuous vector representations of words from very large data sets.
In the proposed methodology we created an ontology in the domain of consumer protection law with the help of legal experts. The word2vec model was trained with the legal cases from FindLaw  online database. The word embedding vectors of the instances and the class labels in the created ontology were the obtained using the trained word2vec model. For each ontology class a number of candidate vectors were then calculated using the word embedding vectors of the instances. The candidate vectors and the class label vectors were then used to train a machine learning model to predict the best representative vector. We show that our proposed methodology outperforms the traditional average (mean) vector representation and median vector representation in all classes. On average, the distance of our representative vector to the class vector is , while the mean vector has a distance of and the median vector has a distance of . Respectively, it is a and improvement.
The rest of this paper is organized as follows: In Section II we review previous studies related to this work. The details of our methodology for deriving a representative vector for ontology classes with instance word vector embeddings is introduced in Section III. In Section IV, we demonstrate that our proposed methodology produces superior results outperforming traditional approaches. At last, we conclude and discuss some future works in Section V.
Ii Background and Related Work
This section discusses the background details of the techniques utilized in this study and related previous studies carried out by others in various areas relevant to this research. The following subsections given below cover important key areas of this study.
In many areas, ontologies are used to organize information as a form of knowledge representation. An ontology may model either the world or a part of it as seen by the said area’s viewpoint .
Individuals (instances) make up the ground level of an ontology. These can be either concrete objects or abstract objects. Individuals are then grouped into structures called classes. Depending on the domain on which the ontology is based, a class in an ontology can be referred to as a concept, type, category, or a kind. More often, the definition of a class and the role thereof is analogous to that of a collection of individuals with some additional properties that distinguish it from a mere set of objects. A class can either subsume, or be subsumed by, another class. This subsuming process give rise to the class hierarchy and the concept of super-classes (and sub-classes).
Ii-B Word set expansion
In many Natural Language Processing (NLP) tasks, creating and maintaining word-lists is an integral part. The said word-lists usually contain words that are deemed to be homogeneous in the level of abstraction involved in the application. Thus, two words and might belong to a single word-list in one application but belong to different word-lists in another application. This fuzzy definition and usage is what makes creation and maintenance of these word-lists a complex task.
For the purpose of this study, we selected the algorithm presented in  which was built on the earlier algorithm described in . The reason for this selection is: WordNet  based linguistic processes are reliable due to the fact that the WordNet lexicon was built on the knowledge of expert linguists.
Ii-C Word Embedding
Word embedding systems, are a set of natural language modeling and feature learning techniques, where words from a domain are mapped to vectors to create a model that has a distributed representation of words, first proposed by . Each of the words in a text document is mapped to a vector space. In addition to that, word meanings and relationships between the words are also mapped to the same vector space. word2vec111https://code.google.com/p/word2vec/, GloVe , and Latent Dirichlet Allocation (LDA)  are leading Word Vector Embedding systems. Both Word2vec and GloVe use word to neighboring word mapping to learn dense embeddings. The difference is that Word2vec uses a neural network based approach while GloVe uses matrix factorization mechanism. LDA also has an approach that utilizes matrices but there the words are mapped with the relevant sets of documents. Due to the flexibility and ease of customization, we picked word2vec as the word embedding method for this study.
Word2vec is used in sentiment analysis [20, 21, 22, 23] and text classification . In the ontology domain there are two main works that involve word2vec: Gerhard Wohlgenannt et al. ’s approach to emulate a simple ontology using word2vec and Harmen Prins ’s usage of word2vec extension: node2vec , to overcome the problems in vectorization of an ontology.
Clustering is a seminal part in exploratory data mining and statistical data analysis. The objective of clustering is to group a set of items into separate sub-sets (clusters) where the items in a given cluster is more similar to each other than any of them is similar to an item from a different cluster. The used similarity measure and the desired number of clusters are application dependent. A clustering algorithm is inherently modeled as an iterative multi-objective optimization problem that involves trial and error which tries to move towards a state that exhibits the desired properties. Out of all the clustering methods available, we selected k-means clustering due to the easiness of implementation and configuration.
Arguably, k-means clustering was first proposed by Stuart Lloyd  as a method of vector quantization for pulse-code modulation in the domain of signal processing. The objective is to partition observations into clusters where each observation belongs to the cluster with the mean such that in the set of cluster means , is the closest to when measured by a given vector distance measure. It is implicitly assumed that the cluster mean serves as the prototype of the cluster. This results in the vector space being partitioned into Voronoi cells.
Ii-E Support Vector Machines
Support Vector Machines  is a supervised learning model that is commonly used in machine learning tasks that analyze data for the propose of classification or regression analysis. The SVM algorithm works on a set of training examples where each example is tagged as to be belonging to one of two classes. The objective is to find a hyperplane dividing these two classes such that, the examples of the two classes are divided by a clear gap which is as wide as mathematically possible. Thus the process is a non-probabilistic binary linear classifier task. The aforementioned gap is margined by the instances that are named support vectors.
The idea of using SVMs for task pertaining ontologies are rather rare. However, a study by Jie Liu et al.  defined a similarity cube which consists of similarity vectors combining similar concepts, instances and structures of two ontologies and then processed it through SVM based mapping discovery function to map similarities between two ontologies. Further, another study by Jie Liu et al.  has proposed a method of similarity aggregation using SVM, to classify weighted similarity vectors which are calculated using concept name and properties of individuals of ontologies. Our usage of SVM in the ontology domain in this paper is different from their approach and hence entirely novel.
In Generic SVM, new examples that are introduced are then predicted to be falling into either class depending on the relationship between that new example and the hyperplane that is dividing the two classes. However, in this study we do not need to employ the new instance assignment. We are only interested in calculating the support vectors. The usage and rationalization of this is given in Section III-D
We discuss the methodology that we used for deriving a vector representation for ontology classes using instance vector embeddings in this section. Each of the following subsections describe a step of our process. An overview of the methodology we propose in Sections III-A and III-B is illustrated in Fig. 1 and an overview of the methodology we propose from Section III-C to Section III-G is illustrated in Fig. 2.
Iii-a Ontology Creation
We created a legal ontology based on the consumer protection law, taking Findlaw  as the reference. After creating the ontology class hierarchy, we manually added seed instances for all the classes in the ontology. This was done based on manual inspection of the content of legal cases under consumer protection law in Findlaw. Next we used the algorithm proposed in  to expand the instance sets. The expanded lists were then pruned manually to prevent conceptual drift. This entire process was done under the supervision and guidance of experts from the legal domain.
Iii-B Training word Embeddings
The training of the word embeddings was the process of building a word2vec model using a large legal text corpus obtained from Findlaw . The text corpus consisted of legal cases under 78 law categories. In creating the legal text corpus we used Stanford CoreNLP for preprocessing the text with tokenizing, sentence splitting, Part of Speech (PoS) tagging, and lemmatizing.
The motive behind using a pipeline that pre-processes text up to and including lemmatization instead of the traditional approach of training the word2vec model with just tokenized text, was to map all inflected forms of a given lemma to a single entity. In the traditional approach each inflected form of a lemma gets trained as a separate vector. This dilutes the values that is extracted from the context of that lemma between all inflected forms. Thus sometimes resulting in values not meeting the threshold when a test of significance is done. By having all inflicted forms to be reduced to the relevant lemma and train, we made sure that all the contributions of the context of a given lemma is collected at a single vector, thus making the system more accurate. Secondly, having a unique vector for each inflected form makes the model unnecessarily large and heavy. This results in difficulties at both training and testing. Our approach of lemmatizing the corpus first solves that problem as well. In addition to this, to reduce ambiguities caused by the case of the strings, we converted all strings to lowercase before training the word2vec model. This too is a divergence from the conventional approach.
Iii-C Sub-Cluster Creation
By definition, the instances in a given class of an ontology is more semantically similar to each other than instances in other classes. But no matter how coherent a set of items is, as long as that set contains more than one element, it is possible to create non-empty sub-sets that are proper subsets of the original set. This is the main motivation behind this step of our methodology. Following this rationale, it was decided that it is possible to find at least one main schism in the otherwise semantically coherent set of more than one instances.
Given that even then, the schisms would be fairly small by definition in comparison to the difference of instances in one class and the instances of another class, it was decided to stop the sub-set creation at 2. Which means we decided to divide the instances in a single class in to two sub-clusters. For this purpose, we use K-means clustering with K=2. For an example, we are subdividing the ”Judge” class using k-means to ”Judge1” and ”Judge2” and then use the support vectors between those two to predict the label of ”Judge”. The centers of these sub-clusters are two of the candidate vectors used in Section III-E.
Iii-D Support Vector Calculation
It is clear that the specification of the problem handled in this study is more close to a clustering task than a classification task. In addition to that, given the fact that each time we would be running the proposed algorithm, it would be for instances in a single class. That implies that, even if we are to assign class labels to the instances, to model it as a classification task, it would have been a futile effort because there exist only one class. Thus, having a classifying algorithm such as Support Vector Machines (SVM) as part of the process of this study might seem peculiar at the beginning. However, this problem is no longer an issue due to the steps that were taken in the Section III-C. Instead of the single unified cluster with the homogeneous class label, the previous step yielded two sub-clusters with two unique labels. Given that the whole premise of creating sub-clusters was based on the argument that there exists a schism in the individual vectors in the class, it is logical to have the next step to quantify that schism.
For this task we used an SVM. The SVM was given the individuals in the two sub-clusters as the two classes and was directed to discover the support vectors. This process found the support vectors to be on either side of the schism that was discussed in Section III-C.
In identifying the support vectors, we used the algorithm used by Edgar Osuna et al.  in training support vector machines and then performed certain modifications to output the support vectors.
Iii-E Candidate Matrix Calculation
With the above steps completed, in order to derive the vector representation for ontology classes, we calculated a number of candidate vectors for each class and then derived the candidate matrix from those candidate vectors. We describe the said candidate vectors below:
Average support vector ()
Average instance vector ()
Class Median vector ()
Sub-cluster average vectors ()
The Class name Vector () is obtained by performing word vectorization on the selected class’s class name and it was used as our desired output.
Iii-E1 Average support vector ()
We identified the support vectors which mark the space around the hyperplane that divides the class into the two subclasses as mentioned in Section III-D. We take the average of the said support vectors as the first candidate vector. The rationale behind this idea is that as described in Section III-C, there exists a schism in between the two sub-classes. The support vectors mark the edges of that schism which means the average of the support vectors fall on the center of the said schism. Given that the schism is a representative property of the class, it is rational to consider this average support vector that falls in the middle of it as representative of the class. We averaged the instance vectors as shown in equation 1 to calculate the average support vector.
Here, is the total number of vectors in the class. represents instance vectors. is the support vector membership vector such that: if the th vector is a support vector, is and otherwise, it is . Here is which represents the average support vector candidate vector.
Iii-E2 Average instance vector ()
This is by far the most commonly used class representation vector in other studies. We took all the instance vectors of the relevant class, and averaged them to come up with this candidate vector for the class. The Average instance vector was also calculated using the equation 1. However, this time all the s were initiated to . In that case, is which represents the average instance vector.
Iii-E3 Class Median vector ()
Out of the instance vectors of a class, we calculated the median vector of them and added it as a candidate vector for that class. The Class Median vector () was calculated as shown in equation 2 where: is the set of instance vectors in the class. is the th instance vector. is the number of dimensions in an instance vector. is the th element in the average instance vector that was calculated above using equation 1.
Iii-E4 Sub-cluster average vectors ()
We took all the instance vectors of one sub-cluster and averaged them to calculate and then did the same for the other sub-cluster to calculate . The rationale behind this step is the fact that as described in Section III-C, the two sub-clusters that we find are representative of the main division that exists in the class. Thus it is justifiable to consider the centroid of each of those sub-clusters.
Each sub-cluster average instance vector was also calculated using the equation 1. However, this time all the s in the first cluster was initiated to and the s in the second cluster was initiated to . In that case, is which represents the average instance vector for the first cluster. Next the same function was used with all the s in the first cluster initiated to and the s in the second cluster initiated to to calculate which was assigned to .
Iii-E5 Candidate Matrix
Finally the candidate matrix for each class is made by concatenating the transposes of though as shown in equation 3.
Iii-F Optimization Goal
After calculating the candidate vectors, we proposed an optimal vector that represents the given class based on the optimization goal as follows:
Here, is the predicted class vector for the given class. is the number of candidate vectors for a given class. and represents the th candidate vector and the associated weight of that candidate vector respectively. Here the is calculated using the method described in Section III-G.
Iii-G Machine Learning implementation for weight calculation
The main motive behind adding a weight for each candidate vector is to account for the significance of the candidate vector towards the optimized class vector. We decided to use machine learning to calculate the weight vector. The machine learning method we used is a neural network.
The dataset is structured in the traditional structure where is the set of inputs and is the set of outputs. An input tuple (such that ), has elements . The matching output tuple (such that ) has a single element.
and is populated as follows: Take the th row of the candidate matrix of the class as and add it to . Take the th element of of the class and add it to . Once this is done for all the classes, we get the set. It should be noted that since the weights are learned universally and not on a class by class basis, there will be one large dataset and not a number of small datasets made up of one per class. The reason for this is to make sure that the model does not overfit to one class and would instead generalize over all the data across the entire ontology. Because of this approach, we end up with a considerable amount of training data which again justifies our decision to use machine learning. For a word embedding vector length of over number of classes, this approach creates number of training examples.
We used a set of legal ontology classes seeded by the legal experts and then expanded by the set expansion  algorithm under the guidance of the same legal experts. We report our findings below in the table I and inFig.3 we show a visual comparison of the same data. We illustrate the results that we obtained pertaining to ten prominent legal concept classes as well as the mean result across all the classes considered. We compare the representative vector proposed by us against the traditional representative vectors: average vector and median vector. All the results shown in the table are Euclidean distances obtained between the representative vector in question against the respective vector.
|Average Vector||Median Vector||Our Model|
It can be observed from the results that the traditional approach of taking the average vector as the representative vector of a class is outperforming the other traditional approach of using the median vector. However, our proposed method outperforms both the average and median vectors in all cases. For an example, considering the ”Judge” class, it can be seen that our model vector perform 47.8% better than the average vector where it is 53.8% better in the ”Complaint” class.
V Conclusion and Future Works
In this work, we have demonstrated that the proposed method works as a better representation for a set of instances that occupy an ontology class than the traditional methods of using the average vector or the median vector. This discovery will be helpful in mainly two important tasks in the ontology domain.
The first one is further populating an already seeded ontology. We, in this study used the algorithm proposed in  for this task to obtain a test dataset. However, that approach has the weakness of being dependent on the WordNet lexicon. A methodology built on the representative vector discovery algorithm proposed in this work will not have that weakness. This is because all the necessary vector representations are obtained from word vector embeddings done using a corpus relevant to the given domain. Thus all the unique jargon would be adequately covered without much of a threat of conceptual drift. As future work, we expect to take this idea forward for automatic ontology population.
The second important task in the ontology domain that this method will be important is, class labeling. In this study we have demonstrated that our method is capable in deriving the closest vector representation to the class label. Thus, the converse of that would be true as well. That would be a topic modeling  task. The idea is that if given an unlabeled class, the method proposed by this study can be used to derive the representative vector. Then by querying the word embedding vector space it is possible to obtain the most suitable class label (topic) candidate.
-  T. R. Gruber, “A translation approach to portable ontology specifications,” Knowledge Acquisition, 5(2):199-220, 1993.
-  X.-Q. Yang, N. Sun, T.-L. Sun, X.-Y. Cao, and X.-J. Zheng, “The application of latent semantic indexing and ontology in text classification,” International Journal of Innovative Computing, Information and Control, vol. 5, no. 12, pp. 4491–4499, 2009.
-  N. de Silva, “Safs3 algorithm: Frequency statistic and semantic similarity based semantic classification use case,” Advances in ICT for Emerging Regions (ICTer), 2015 Fifteenth International Conference on, pp. 77–83, 2015.
-  N. De Silva, A. Perera, and M. Maldeniya, “Semi-supervised algorithm for concept ontology based word set expansion,” Advances in ICT for Emerging Regions (ICTer), 2013 International Conference on, pp. 125–131, 2013.
-  G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller, “Introduction to wordnet: An on-line lexical database,” International journal of lexicography, vol. 3, no. 4, pp. 235–244, 1990.
-  G. A. Miller, “Nouns in wordnet: a lexical inheritance system,” International journal of Lexicography, vol. 3, no. 4, pp. 245–264, 1990.
-  C. Fellbaum, WordNet. Wiley Online Library, 1998.
-  I. Wijesiri, M. Gallage, B. Gunathilaka, M. Lakjeewa, D. C. Wimalasuriya, G. Dias, R. Paranavithana, and N. De Silva, “Building a wordnet for sinhala,” Volume editors, p. 100, 2014.
-  J. Huang, F. Gutierrez, H. J. Strachan, D. Dou, W. Huang, B. Smith, J. A. Blake, K. Eilbeck, D. A. Natale, Y. Lin et al., “Omnisearch: a semantic search system based on the ontology for microrna target (omit) for microrna-target gene interaction data,” Journal of biomedical semantics, vol. 7, no. 1, p. 1, 2016.
-  J. Huang, K. Eilbeck, B. Smith, J. A. Blake, D. Dou, W. Huang, D. A. Natale, A. Ruttenberg, J. Huan, M. T. Zimmermann et al., “The development of non-coding rna ontology,” International journal of data mining and bioinformatics, vol. 15, no. 3, pp. 214–232, 2016.
-  D. C. Wimalasuriya and D. Dou, “Ontology-based information extraction: An introduction and a survey of current approaches,” Journal of Information Science, 2010.
-  N. de Silva, D. Dou, and J. Huang, “Discovering inconsistencies in pubmed abstracts through ontology-based information extraction,” ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM BCB), p. to appear, 2017.
-  N. Choi, I.-Y. Song, and H. Han, “A survey on ontology mapping,” ACM Sigmod Record, vol. 35, no. 3, pp. 34–41, 2006.
-  T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” Advances in neural information processing systems, pp. 3111–3119, 2013.
-  T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
-  “FindLaw cases and codes,” http://caselaw.findlaw.com/, accessed: 2017-05-18.
-  N. de Silva, C. Fernando, M. Maldeniya, D. Wijeratne, A. Perera, and B. Goertzel, “Semap-mapping dependency relationships into semantic frame relationships,” in 17th ERU Research Symposium, vol. 17. Faculty of Engineering, University of Moratuwa, Sri Lanka, 2011.
-  J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation.” in EMNLP, vol. 14, 2014, pp. 1532–1543.
-  R. Das, M. Zaheer, and C. Dyer, “Gaussian lda for topic models with word embeddings.” in ACL (1), 2015, pp. 795–804.
-  D. Tang, F. Wei, N. Yang, M. Zhou, T. Liu, and B. Qin, “Learning sentiment-specific word embedding for twitter sentiment classification,” Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 1555–1565, 2014.
-  B. Xue, C. Fu, and Z. Shaobin, “Study on sentiment computing and classification of sina weibo with word2vec,” Big Data (BigData Congress), 2014 IEEE International Congress on. IEEE, pp. 358–363, 2014.
-  D. Zhang, H. Xu, Z. Su, and Y. Xu, “Chinese comments sentiment classification based on word2vec and svm perf,” Expert Systems with Applications, vol. 42, no. 4, pp. 1857–1863, 2015.
-  H. Liu, “Sentiment analysis of citations using word2vec,” arXiv preprint arXiv:1704.00177, 2017.
-  J. Lilleberg, Y. Zhu, and Y. Zhang, “Support vector machines and word2vec for text classification with semantic features,” in Cognitive Informatics & Cognitive Computing (ICCI* CC), 2015 IEEE 14th International Conference on. IEEE, 2015, pp. 136–140.
-  G. Wohlgenannt and F. Minic. Using word2vec to build a simple ontology learning system. Available at: http://ceur-ws.org/Vol-1690/paper37.pdf. Accessed: 2017-05-30.
-  H. Prins, “Matching ontologies with distributed word embeddings.”
-  A. Grover and J. Leskovec, “node2vec: Scalable feature learning for networks,” pp. 855–864, 2016.
-  S. Lloyd, “Least squares quantization in pcm,” IEEE transactions on information theory, vol. 28, no. 2, pp. 129–137, 1982.
-  C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995.
-  L. Liu, F. Yang, P. Zhang, J.-Y. Wu, and L. Hu, “Svm-based ontology matching approach,” International Journal of Automation and Computing, vol. 9, no. 3, pp. 306–314, 2012.
-  J. Liu, L. Qin, and H. Wang, “An ontology mapping method based on support vector machine,” pp. 225–226, 2013.
-  E. Osuna, R. Freund, and F. Girosi, “An improved training algorithm for support vector machines,” pp. 276–285, 1997.