Fast and scalable learning of neurosymbolic representations of biomedical knowledge
Abstract
In this work we address the problem of fast and scalable learning of neurosymbolic representations for general biological knowledge. Based on a recently published comprehensive biological knowledge graph (Alshahrani, 2017) that was used for demonstrating neurosymbolic representation learning, we show how to train fast (under 1 minute) loglinear neural embeddings of the entities. We utilize these representations as inputs for machine learning classifiers to enable important tasks such as biological link prediction. Classifiers are trained by concatenating learned entity embeddings to represent entity relations, and training classifiers on the concatenated embeddings to discern true relations from automatically generated negative examples. Our simple embedding methodology greatly improves on classification error compared to previously published stateoftheart results, yielding a maximum increase of Fmeasure and ROC AUC scores for the most difficult biological link prediction problem. Finally, our embedding approach is orders of magnitude faster to train ( 1 minute vs. hours), much more economical in terms of embedding dimensions ( vs. ), and naturally encodes the directionality of the asymmetric biological relations, that can be controlled by the order with which we concatenate the embeddings.
Keywords:
knowledge graphs, neural embeddings, biological link prediction1 Introduction
Over the last decade there has been a very popular trend of merging neural and symbolic representations of knowledge for the large, generalpurpose knowledge graphs such as FreeBase [1] and WordNet [2]. The utilized methods can be roughly divided into two groups: i) multirelational knowledge graph embeddings [3, 4] and ii) graph embeddings [5, 6]. The former aims at learning representations of both entities and relations, while the latter focus on the untyped graphs, where each relation’s type can be dropped without introducing ambiguities. Both approaches aim at solving the problem of link prediction, i.e., modeling the probability of an instance of a relation (e.g., ) based on dimensional vector representations (e.g., ) and binary operations defined on them. Thus, in the case of multirelational knowledge graphs we seek to embed both entities and relations into dimensional vector space, and we model the probability of a triple (labeled arc of a graph) as (Euclidean dot product). In the case of unlabeled graphs we drop the labels of the arcs (or edges in case the relations can be treated as symmetric), we therefore do not embed the relations, and model one single arc (or edge) directly as . The Euclidean dot product is only of the many ways to model a probability of having a link (with a label in the multirelational case) between the two entities . In fact, the underlying geometry may not necessarily be Euclidean, for more indepth survey of link prediction methodologies please see [4]. In the context of Semantic Web technologies and the Resource Description Framework (RDF) and Web Ontology (OWL) technology stack specialized knowledge graph embedding methodologies have also recently been proposed [7, 8].
In the bioinformatics domain Alshahrani et al. [9] recently proposed a novel methodology for representing nodes and relations from structured biological knowledge that operates directly on Linked Data resources, leverages ontologies, and yields neurosymbolic representations amenable for downstream use in machine learning algorithms. The authors base their methodology on the DeepWalk algorithm, which performs random walks on the unlabeled and undirected graphs (i.e., with symmetric relations) [5] and embeds entities through an approach inspired by the popular Word2Vec algorithm [10]. This methodology is further tuned for multirelational data by explicitly encoding the sequences of intermingled entities and relations. Such complex intermingled sequences alleviate the innate undirected nature of the random walks, at the expense of increased number of parameters to train. Unfortunately, training such models is computationally expensive (hours on a modern intel core i7 desktop machine) and requires relatively large embedding dimensions (). This manuscript builds upon this seminal work and proposes a more economical, fast and scalable way of learning neurosymbolic representations. The neural embeddings obtained with our approach outperform published stateoftheart results, with specific assumptions on the structure of the original knowledge graph, and with the smart encoding of links based on the embeddings of the entities. Among other things the contributions of this work are based on the following hypotheses:

Using the concatenation of the neural embeddings naturally encodes the directionality of the asymmetric biological relations, and fully exploits the nonlinear patterns that can be uncovered by the neural network classifiers.
2 Materials and methods
2.1 Dataset and evaluation methodology for link prediction used
In this work we consider the curated biological knowledge graph, presented in [9]. This knowledge graph is based on the three ontologies: Gene Ontology [12], Human Phenotype Ontology [13] and the Disease Ontology [14]. It also incorporates the knowledge from several biological databases, including human proteins interactions, human chemicalprotein interactions and drug side effects and drug indications pairs. We refer the reader to [9] for the detailed description on provenance of the data, and on data processing pipelines employed to obtain the final graph. For the purpose of this work, we summarize the number of biological relation instances present in this knowledge graph in Table 1.
relation  number of instances 

hastarget  554366 
hasdiseaseannotation  236259 
hassideeffect  54806 
hasinteraction  188424 
hasfunction  212078 
hasgenephenotype  153575 
hasindication  6704 
hasdiseasephenotype  84508 
Our goal is to train fast neural embeddings of the nodes of this knowledge graph, such that we could use these embeddings to perform link prediction. That is, we try to estimate the probability that an edge with label (e.g., ) exists between the nodes (e.g., and ) given their vector representations . As in [9] we build separate binary prediction models for each relation in the knowledge graph. Note that, in this work we only focus on the link prediction problem where the embeddings are trained on the knowledge graph, in which we remove the 20% of the edges for a given relation (this corresponds to the first link prediction problem reported in [9]). We then use these embeddings to train classifiers (logistic regression and multilayer perceptron (MLP)) on 80% of the positive true edges (i.e., relation instances) and on the same amount of generated negative edges. These classifiers are then tested on the remaining 20% positive and generated negative edges (which have not been used in the embeddings generation). For a fair comparison with the stateoftheart results, we use the same methodology for negative sample generation, and we use 5fold cross validation for the training of embeddings and subsequent link prediction classifiers, precisely the same way as in [9]. For all of our experiments we do not use any deductive inference, and compare our obtained results with the results obtained without inference in [9].
2.2 Assumptions on the structure of the Knowledge Graph
Our methodology exploits the fact that the full biomedical knowledge graph we are using only contains relations that can be inferred from the types of the entities that are object and subject of the relation. This means that arc labels can be safely dropped without the loss of semantics and without the introduction of ambiguous duplicated pairs of nodes (). Therefore, we can flatten our graph without the risk of having more than one relation connecting the same source and target nodes, i.e., we can simply consider our knowledge graph as a set of pairs of nodes . As opposed to DeepWalk employed by [9], our methodology does not rely on random walks on knowledge graphs [5]; instead of producing sequences of labeled entities (nodes and arc labels mixed together), we directly consider pairs of connected nodes. Furthermore, we simplify the structure of the knowledge graph by removing anonymous instances that were introduced by the creator of the knowledge graph to assert relation instances in the ABox, i.e., we directly connect OWL classes to declutter the graph used to train embeddings. In the original knowledge graph, Alshahrani et al. [9] commit to strict OWL semantics when modeling biological relations by asserting anonymous instances, for example a relation instance of hasfunction (domain: Gene/Protein, range: Function) would be encoded as in Listing 1, where we present a specific instance of a relation that asserts that the TRIM28 gene has the function of negative regulation of transcription by RNA polymerase II.
We simplify the knowledge graph by removing all anonymous instances of type <http://aberowl.net/go/instance_106358> and connecting entities directly through object relations, i.e., we rewrite all triples of the form presented above (Listing 1) to the form that only contains object property assertions as demonstrated below (Listing 2).
We admit such a relaxation in the OWL semantics commitment of the knowledge graph, because we do not leverage any OWL reasoning for our tasks. This relaxation does not change the statistics of the number of biological relation instances present in the knowledge graph (Table 1).
2.3 Training fast loglinear embeddings with StarSpace
As opposed to the approach taken by Alshahrani et al [9] we employ another neural embedding method which requires fewer parameters and is much faster to train. Specifically, we exploit the fact that the biological relations have well defined nonoverlapping domain and ranges, and therefore the whole knowledge graph can be treated as an untyped directed graph, where there is no ambiguity in the semantics of any relation. To this end, we employ the neural embedding model from the StarSpace toolkit [11], which aims at learning entities, each of which is described by a set of discrete features (bagoffeatures) coming from a fixedlength dictionary. The model is trained by assigning a dimensional vector to each of the discrete features in the set that we want to embed directly. Ultimately, the lookup matrix (the matrix of embeddings  latent vectors) is learned by minimizing the following loss function
In this loss function, we need to indicate the generator of positive entry pairs – in our setting those are entities connected via a relation – and the generator of negative entities , similar to the negative sampling strategy proposed by Mikolov et al. [10]. In our setting, the negative pairs are the socalled negative examples, i.e., pairs of entities that do not appear in the knowledge graph. The similarity function is taskdependent and should operate on dimensional vector representations of the entities, in our case we use the standard Euclidean dot product. Please note that the aforementioned embedding scheme is different from a multirelational knowledge graph embedding task. The main difference is that we do not require the embeddings for the relations.
Based on the embeddings of the nodes of the graph, we can come up with different ways of representing a link between a node and , as a binary operation defined on the nodes of the graph (see [6] for more detail). In particular, we employ the socalled concatenation of the embeddings to represent each relation instance as a concatenated vector (Figure 1).
3 Results
In Table 2 we report the stateoftheart evaluation scores as provided in Alshahrani et al [9]. Throughout the rest of this manuscript we refer to these results as SOTA results for convenience. We further use these stateoftheart results to contrast our classification results in Tables 3 and 4. To simplify the interpretation of our results, both Tables 3, 4 report only differences in Fmeasure and ROC AUC scores for our approach wrt. the SOTA results. Classification results are divided into two parts, differentiated by the classifier used: i) (Table 3) logistic regression (as in [9]), and ii) (Table 4)) MLP. The two classifiers are trained on concatenated embeddings of entities (nodes), which are obtained from the flattened graphs for each biomedical relations via StarSpace [11], as described in Section 2. All classification results presented here are averaged over 5 folds to be directly and fairly compared with the results in [9].
relation  Fmeasure  ROC AUC 

hasdiseaseannotation  0.89  0.95 
hasdiseasephenotype  0.72  0.78 
hasfunction  0.85  0.95 
hasgenephenotype  0.84  0.91 
hasindication  0.72  0.79 
hasinteraction  0.82  0.88 
hassideeffect  0.86  0.93 
hastarget  0.94  0.97 
3.1 Biomedical link prediction with logistic regression
Overall, we are able to outperform SOTA results on all relations except for hastarget (Table 3). It is important to notice that we improve significantly on hasindication and hasdiseasephenotype  the two worst performing relations in Alshahrani et al [9]. We specifically consider the embeddings of rather small sizes () to emphasize the rapidity and scalability of training embeddings using loglinear neural embedding approaches [11]. For all embedding dimensions we train our embeddings for at most 10 epochs, which keeps overall training time of embeddings for one specific biomedical relation under 1 minute on a Core i7 desktop with 32GB of RAM. It is also important to notice that the SOTA results were obtained via the extended DeepWalk algorithm [9] with 512 dimensions for the embeddings, which takes several hours to train on our machine. Moreover, our learned embeddings are more consistent, as they have a 0.920.99 Fmeasure and ROC AUC range for all relations, whereas SOTA embeddings range from 0.72 to 0.94.
Fmeasure  ROC AUC  

5  10  20  50  5  10  20  50  
hasdiseaseannotation  0.027  +0.013  +0.033  +0.071  0.088  0.047  0.028  +0.012 
hasdiseasephenotype  +0.239  +0.260  +0.274  +0.279  +0.180  +0.200  +0.214  +0.219 
hasfunction  +0.013  +0.028  +0.067  +0.117  0.077  0.066  0.030  +0.017 
hasgenephenotype  +0.148  +0.156  +0.159  +0.159  +0.078  +0.086  +0.089  +0.089 
hasindication  +0.186  +0.262  +0.270  +0.275  +0.112  +0.192  +0.200  +0.205 
hasinteraction  +0.010  +0.147  +0.179  +0.180  0.034  +0.088  +0.119  +0.120 
hassideeffect  +0.091  +0.105  +0.128  +0.137  +0.021  +0.036  +0.059  +0.067 
hastarget  0.107  0.077  0.047  0.018  0.109  0.083  0.057  0.034 
3.2 MLP and biomedical link prediction
We hypothesize that our approach of augmented embedding dimension via concatenation of entity embeddings is more suited for neural network architectures. Indeed, we are able to obtain very good biological link prediction classifiers by using concatenated embeddings and multilayer perceptrons. We experimeted with different shallow and deep architectures (hidden layer sizes ([200], [20, 20, 20], [200, 200, 200]), which yielded almost similar performances. The results of a shallow neural networks with one hidden layer consisting of 200 neurons are summarized in Table 4, that empirically show that the concatenation of the neural embeddings to represent a link between the two entities fully exploits the nonlinearity patterns, which can be uncovered by the neural network classifiers. As a result, we are able to improve the SOTA results for all the biological link prediction tasks.
Fmeasure  ROC AUC  

5  10  20  50  5  10  20  50  
hasdiseaseannotation  +0.095  +0.109  +0.110  +0.110  +0.035  +0.049  +0.050  +0.050 
hasdiseasephenotype  +0.272  +0.279  +0.280  +0.280  +0.212  +0.219  +0.220  +0.220 
hasfunction  +0.148  +0.150  +0.149  +0.150  +0.048  +0.050  +0.049  +0.050 
hasgenephenotype  +0.160  +0.160  +0.160  +0.160  +0.089  +0.090  +0.090  +0.090 
hasindication  +0.276  +0.278  +0.279  +0.279  +0.206  +0.208  +0.209  +0.209 
hasinteraction  +0.180  +0.180  +0.180  +0.180  +0.120  +0.120  +0.120  +0.120 
hassideeffect  +0.128  +0.137  +0.139  +0.139  +0.058  +0.067  +0.069  +0.069 
hastarget  0.024  +0.006  +0.023  +0.033  0.040  0.016  0.003  +0.006 
4 Discussion and conclusion
Recent trends of neurosymbolic embeddings continue the longsought quest of the artificial intelligence community to unify the two disparate worlds, where the reasoning is performed either in a discrete symbolic space or in a continuous vector space. As a community, we are still somewhere along this road, and up to date there has still been no evidence of a clear way of combining the two approaches. The neurosymbolic representations based on random walks on RDF data for the general biological knowledge as introduced by [9] are an important first development. The methodology allows for leveraging the existing curated and structured biological knowledge (Linked Data), incorporating OWL reasoning, and enabling the inference of hidden links that are implicitly encoded in the biological knowledge graphs. However, as our results demonstrate, it is possible to obtain improved classification results for link prediction if we relax the constraints of multirelational biological knowledge structure, and consider all arcs as part of one semantic relation. Such a relaxation gives rise to faster and more economical generation of neural embeddings, which can be further used in scalable downstream machine learning tasks. While our results demonstrate excellent prediction performance (all Fmeasure and ROC AUC scores range in 0.920.99), they outline that having very wellstructured input data is a core ingredient. Indeed, the biological knowledge graph curated by Alshahrani et al. [9] implicitly encodes significant biological knowledge available to the community, and simple loglinear embeddings coupled with shallow neural networks are enough to obtain very good prediction results for the transductive link prediction problems. Unfortunately, the quest of merging symbolic and continuous representations cannot be fulfilled to its advertised limits, as was already mentioned in [9], symbolic inference (OWLEL reasoning) do not yield significant improvements on link prediction tasks. Indeed, we managed to get very good scores without any deductive completion of the Abox of the knowledge graph. Another important aspect which we implicitly emphasized in our work is the evaluation strategy of the neural embeddings. When dealing with big and rich knowledge graphs one has to meticulously generate train and test splits, which avoid potential leakage of information between the two sets. Failing to do so might lead to the models which overfit and are unable to truly perform link predictions. As part of our future work we would like to focus on the creation of different evaluation strategies that test the quality of the neural embeddings, their explainability, and we would like to consider not only transductive link prediction problems, but also focus on the more challenging inductive cases.
References
 [1] Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: A collaboratively created graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data  SIGMOD ’08, New York, New York, USA, ACM Press (jun 2008) 1247
 [2] Miller, G.A.: Wordnet: a lexical database for english. Commun ACM 38(11) (nov 1995) 39–41
 [3] Bordes, A., Usunier, N., GarciaDuran, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multirelational data. (2013)
 [4] Nickel, M., Murphy, K., Tresp, V., Gabrilovich, E.: A review of relational machine learning for knowledge graphs. Proc. IEEE 104(1) (jan 2016) 11–33
 [5] Perozzi, B., AlRfou, R., Skiena, S.: Deepwalk: Online learning of social representations. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining  KDD ’14, New York, New York, USA, ACM Press (aug 2014) 701–710
 [6] Grover, A., Leskovec, J.: node2vec: Scalable feature learning for networks. KDD 2016 (aug 2016) 855–864
 [7] Ristoski, P., Paulheim, H.: Rdf2vec: Rdf graph embeddings for data mining. In Groth, P., Simperl, E., Gray, A., Sabou, M., KrÃ¶tzsch, M., Lecue, F., FlÃ¶ck, F., Gil, Y., eds.: The semantic web â ISWC 2016. Volume 9981 of Lecture notes in computer science. Springer International Publishing, Cham (2016) 498–514
 [8] Cochez, M., Ristoski, P., Ponzetto, S.P., Paulheim, H.: Global rdf vector space embeddings. In d’Amato, C., Fernandez, M., Tamma, V., Lecue, F., CudrÃ©Mauroux, P., Sequeda, J., Lange, C., Heflin, J., eds.: The semantic web â ISWC 2017. Volume 10587 of Lecture notes in computer science. Springer International Publishing, Cham (2017) 190–207
 [9] Alshahrani, M., Khan, M.A., Maddouri, O., Kinjo, A.R., QueraltRosinach, N., Hoehndorf, R.: Neurosymbolic representation learning on biological knowledge graphs. Bioinformatics 33(17) (sep 2017) 2723–2730
 [10] Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. arXiv (oct 2013)
 [11] Wu, L., Fisch, A., Chopra, S., Adams, K., Bordes, A., Weston, J.: Starspace: Embed all the things! arXiv (sep 2017)
 [12] Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., IsselTarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M., Sherlock, G.: Gene ontology: tool for the unification of biology. the gene ontology consortium. Nat Genet 25(1) (may 2000) 25–29
 [13] Köhler, S., Doelken, S.C., Mungall, C.J., Bauer, S., Firth, H.V., BailleulForestier, I., Black, G.C.M., Brown, D.L., Brudno, M., Campbell, J., FitzPatrick, D.R., Eppig, J.T., Jackson, A.P., Freson, K., Girdea, M., Helbig, I., Hurst, J.A., JÃ¤hn, J., Jackson, L.G., Kelly, A.M., Ledbetter, D.H., Mansour, S., Martin, C.L., Moss, C., Mumford, A., Ouwehand, W.H., Park, S.M., Riggs, E.R., Scott, R.H., Sisodiya, S., Van Vooren, S., Wapner, R.J., Wilkie, A.O.M., Wright, C.F., Vultovan Silfhout, A.T., de Leeuw, N., de Vries, B.B.A., Washingthon, N.L., Smith, C.L., Westerfield, M., Schofield, P., Ruef, B.J., Gkoutos, G.V., Haendel, M., Smedley, D., Lewis, S.E., Robinson, P.N.: The human phenotype ontology project: linking molecular biology and disease through phenotype data. Nucleic Acids Res 42(Database issue) (jan 2014) D966–74
 [14] Kibbe, W.A., Arze, C., Felix, V., Mitraka, E., Bolton, E., Fu, G., Mungall, C.J., Binder, J.X., Malone, J., Vasant, D., Parkinson, H., Schriml, L.M.: Disease ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data. Nucleic Acids Res 43(Database issue) (jan 2015) D1071–8