[

[

[
Abstract

Motivation: Ontologies are widely used in biology for data annotation, integration, and analysis. In addition to formally structured axioms, ontologies contain meta-data in the form of annotation axioms which provide valuable pieces of information that characterize ontology classes. Annotations commonly used in ontologies include class labels, descriptions, or synonyms. Despite being a rich source of semantic information, the ontology meta-data are generally unexploited by ontology-based analysis methods such as semantic similarity measures.
Results: We propose a novel method, OPA2Vec, to generate vector representations of biological entities in ontologies by combining formal ontology axioms and annotation axioms from the ontology meta-data. We apply a Word2Vec model that has been pre-trained on PubMed abstracts to produce feature vectors from our collected data. We validate our method in two different ways: first, we use the obtained vector representations of proteins as a similarity measure to predict protein-protein interaction (PPI) on two different datasets. Second, we evaluate our method on predicting gene-disease associations based on phenotype similarity by generating vector representations of genes and diseases using a phenotype ontology, and applying the obtained vectors to predict gene-disease associations. These two experiments are just an illustration of the possible applications of our method. OPA2Vec can be used to produce vector representations of any biomedical entity given any type of biomedical ontology.
Availability: https://github.com/bio-ontology-research-group/opa2vec
Contact: robert.hoehndorf@kaust.edu.sa and xin.gao@kaust.edu.sa.

\PassOptionsToPackage

utf8inputenc

OPA2Vec]OPA2Vec: combining formal and informal content of biomedical ontologies to improve similarity-based prediction

Smaili et al.]Fatima Zohra Smaili , Xin Gao  and Robert Hoehndorf 

1 Introduction

Biological knowledge is widely-spread across different types of resources. Biomedical ontologies have been highly successful in providing the means to integrate data across multiple disparate sources by providing an explicit and shared specification of a conceptualization of a domain (Gruber, 1995). Notably, ontologies provide a structured and formal representation of biological knowledge through logical axioms (Hoehndorf et al., 2015b), and ontologies are therefore widely used to capture information that is extracted from literature by biocurators (Bodenreider, 2008). However, ontologies do not only include a formal, logic-based structure but also include many pieces of meta-data that are primarily intended for human use, such as labels, descriptions, or synonyms (Smith et al., 2007).

Due to the pervasiveness of ontologies in the life sciences, many applications have been built that exploit various aspects of ontologies for data analysis and to construct predictive models. For example, a wide selection of semantic similarity measures have been developed to exploit information in ontologies (Resnik et al., 1999; Lin et al., 1998; Jiang and Conrath, 1997; Wu and Palmer, 1994; Leacock and Chodorow, 1998; Li et al., 2003; Al-Mubaid and Nguyen, 2006), and semantic similarity measures have successfully been applied to the prediction of protein-protein interactions (Pesquita et al., 2009), gene-disease associations (Köhler et al., 2009), or drug targets (Hoehndorf et al., 2014).

Recently, a set of methods have been developed that can characterize heterogeneous graphs through ‘‘embeddings’’, i.e., methods to generate knowledge graph embedding methods (Bordes et al., 2013; Nickel et al., 2016; Ristoski and Paulheim, 2016). These methods are used to produce feature vectors of knowledge embedded in heterogeneous graphs (i.e., knowledge graphs), and they have already been applied successfully in the biomedical domain (Alshahrani et al., 2017). However, ontologies, in particular those in the biomedical domain, cannot easily be represented as graphs (Rodríguez-García and Hoehndorf, 2018) but rather constitute logical theories that are best represented as sets of axioms (Baader et al., 2003).

Recently we developed Onto2Vec, a method that generates feature vectors from the formal logical content of ontologies (Smaili et al., 2018), and we could demonstrate that Onto2Vec can outperform existing semantic similarity measures. Here, we extend Onto2Vec to OPA2Vec (Ontologies Plus Annotations to Vectors) to jointly produce vector representations of entities in biomedical ontologies based on both the semantic content of ontologies (i.e., the logical axioms) and the meta-data contained in ontologies as annotation axioms. We combine multiple types of information contained in biomedical ontologies, including asserted and inferred logical axioms, datatype properties, and annotation axioms to generate a corpus that consists of both formal statements, natural language statements, and annotations that relate entities to literals. We then apply a Word2Vec model to generate vector representations for any entity named in the ontology. Using transfer learning, we apply a pre-trained Word2Vec model in OPA2Vec to significantly improve the performance in encoding natural language phrases and statements.

We evaluate OPA2Vec using two different ontologies and applications: first, we use the Gene Ontology (GO) (Ashburner et al., 2000) to produce vector representations of yeast and human proteins and determine their functional similarity and predict interactions between them; second, we evaluate our method on the PhenomeNET ontology (Hoehndorf et al., 2011; Rodríguez-García et al., 2017) to infer vector representations of genes and diseases and use them to predict gene-disease associations. We demonstrate that OPA2Vec can produce task-specific and trainable representations of biological entities that significantly outperform both Onto2Vec and traditional semantic similarity measures in predicting protein-protein interactions and gene-disease associations. OPA2Vec is a generic method which can be applied to any ontology formalized in the Web Ontology Language (OWL) (Grau et al., 2008; W3C OWL Working Group, 2009).

2 Results

2.1 Encoding ontologies plus annotations as vectors

OPA2Vec is an algorithm that uses asserted and inferred logical axioms in ontologies, combines them with annotation axioms (i.e., meta-data associated with entities or axioms in ontologies) and produces dense vector representations of all entities named in an ontology, or entities associated with classes in an ontology. Ontologies formalized in the Web Ontology Language (OWL) (Grau et al., 2008) are based on a Description Logic (Baader et al., 2003). In Description Logics, an ontology is described as the combination of a TBox and an ABox (Horrocks et al., 2006). The TBox is a set of axioms that formally characterize classes (e.g., behavior SubClassOf: ’biological process’), while the ABox contains a set of axioms that characterize instances (e.g., P0AAF6 instanceOf: hasFunction some behavior). The TBox and ABox together are used by the Onto2Vec method (Smaili et al., 2018) to generate dense vector representations; to achieve this goal, Onto2Vec treats asserted or inferred axioms as sentences which form a corpus, and vectors are generated using Word2Vec (Mikolov et al., 2013b, a).

In addition to the TBox and ABox (i.e., to the formal, logical content), ontologies contain a large amount of meta-data in the form of annotation axioms (Hoehndorf et al., 2015b; Smith et al., 2007). Ontology meta-data consist of the set of non-logical annotation axioms that describe different aspects of ontology classes, relations, or instances. For example, most ontologies associate entities with a label, a natural language description, several synonyms, etc. While such meta-data are distinct from the formal content of an ontology and therefore not exploited by methods such as Onto2Vec, they nevertheless provide valuable information about ontology classes, relations, and instances.

OPA2Vec (Ontologies Plus Annotations to Vectors) extends Onto2Vec to combine both the formal content of ontologies and the meta-data expressed as annotation axioms to generate feature vectors for any named entity in an ontology; the vectors encode for both the formal and informal content that characterize and constrain the entities in an ontology.

Our algorithm generates sentences from OWL annotation axioms to form a corpus. For example, from the assertion that an OWL class has a label (using the rdfs:label in the OWL annotation axiom) we generate the sentence C rdfs:label L (using the complete IRIs for , , and rdfs:label). If has an annotation axioms relating it to multiple words or sentences using the annotation property , such as when providing a textual definition or description for a class (e.g., using the Dublin Core description property), we generate a single sentence in which is related using to the complete value of the annotation property (i.e., we ignore sentence or paragraph delimiters). Some annotation properties do not relate entities to strings, but, for example, to dates, numbers, or other literals. For example, an ontology may contain information about the creation date of a class or axiom; we also generate sentences from these annotation axioms and render the value of the annotation property as a string.

In OPA2Vec, we combine the corpus generated from the meta-data (i.e., annotation axioms) and the inferred and asserted logical axioms (using the Onto2Vec algorithm). We then apply a Word2Vec skipgram model on the combined corpus to generate vector representations of all entities in the ontology (for technical details, see Section 4.3).

Natural language words used in annotation properties have a pre-defined meaning which cannot easily be derived from their use within an ontology alone. Therefore, we use transfer learning in OPA2Vec to assign a semantics to natural language words based on their use in a large corpus of biomedical text. In particular, we pre-train a Word2Vec model on all PubMed abstracts so that natural language words are assigned a semantics (and vector representation) based on their use in biomedical literature (see Section 4.2). The vocabulary in biomedical literature overlaps with the values of annotation properties (i.e., the natural language words used to describe entities in ontologies) but is disjoint with the vocabulary generated by Onto2Vec (i.e., the IRIs that make up the classes, relations, and instances in an ontology). In OPA2Vec, we therefore update the pre-trained Word2Vec model to generate vectors for the entities in the ontology, and to update the representations of words that overlap between PubMed abstracts and the ontology annotations.

Figure 1 illustrates the OPA2Vec algorithm. The input of the algorithm is an ontology in the OWL format as well as a set of instances and their associations with classes in the ontology (formulated as the OWL axioms). The output of OPA2Vec is a vector representation for each entity in and that encodes for the logical axioms and meta-data in and .

Figure 1: The detailed workflow of the feature vector generation pipeline of OPA2Vec

2.2 OPA2Vec performance in predicting interactions between proteins

Ontologies are widely used to analyze biological and biomedical datasets (Hoehndorf et al., 2015b), and one of the main applications of ontologies is the computation of semantic similarity (Pesquita et al., 2009). As OPA2Vec combines logical axioms and annotation axioms into single vector representations, we expect that we can obtain more accurate feature vectors for biological entities than using the ontology structure alone, and that we can use this to improve the computation of semantic similarity.

To evaluate our hypothesis and demonstrate the potential of using OPA2Vec, we used the GO ontology as a case study (see Section 4.1). We generated a knowledge base using GO, and added either human proteins or yeast proteins as instances. We related each protein to its functions by asserting that a protein with function is an instance of the class has-function some F. We applied OPA2Vec on these two knowledge bases (one including human proteins and the other yeast proteins) and generated vector representations for each protein and ontology class. We then used these vector representations to predict interactions between proteins as characterized in the STRING database (Szklarczyk et al., 2017) by calculating the cosine similarity between each pair of protein vectors and using the obtained value as a prediction score for whether two proteins interact or not. To further improve our prediction performance, we used a neural network model to learn a similarity measure between two feature vectors that is predictive of protein-protein interactions (Smaili et al., 2018). Figure 2 shows the ROC curves and AUC values obtained for OPA2Vec, and the comparison results against Onto2Vec and Resnik’s semantic similarity measure (Resnik et al., 1999) with the Best Match Average strategy (Pesquita et al., 2009) for human and yeast. The workflow we followed to predict protein-protein interactions using OPA2Vec is illustrated in Figure 3. We found that OPA2Vec significantly improves the performance in predicting interactions between proteins in comparison to both Resnik’s semantic similarity measure and Onto2Vec.

(a) Human
(b) Yeast
Figure 2: ROC curves and AUC values of different methods for PPI prediction for yeast and human. Onto2Vec uses formal ontology axioms and compares vectors through cosine similarity; Onto2Vec(NN) uses a neural network to compare vectors; OPA2Vec is our method and uses formal ontology axioms, entity-class associations and annotation properties from the ontology meta-data (labels, description, synonyms, created_by) with a Word2Vec model pre-trained on PubMed, and compares vectors through cosine similarity; OPA2Vec(No pre-training) uses same strategy as OPA2Vec but without a pre-trained Word2Vec model; OPA2Vec(NN) is OPA2Vec and uses a neural network to determine similarity between two protein vectors; Resnik is a semantic similarity measure.
Figure 3: Workflow for protein-protein interaction (PPI) prediction using OPA2Vec.

To determine the contribution of each annotation property to the performance of OPA2Vec, we restricted the inclusion of annotation properties to each of the following main annotation properties: label (rdfs:label), description (obo:IAO_0000115), synonym (oboInOwl:hasExactSynonym, oboInOwl:hasRelatedSynonym, oboInOwl:hasBroadSynonym, oboInOwl:hasNarrowSynonym), created by (oboInOwl:created_by), creation date (oboInOwl:creation_date), and OBO-namespace (oboInOwl:hasOBONamespace). Figure 4 shows the relative contribution of each of the annotation properties for prediction of protein-protein interactions. We found that the inclusion of the natural language descriptions (obo:IAO_0000115) and the class labels (rdfs:label) results in the highest improvement of performance, while some annotation properties such as creation date or the namespace do not improve prediction. Interestingly, the created_by annotation property adds some minor improvement to the performance, likely due to the fact that the same person would add similar or related classes to the GO, and therefore proteins with functions created by the same person have higher probability to interact.

(a) Human
(b) Yeast
Figure 4: Contribution of each annotation property from the meta-data to the PPI prediction accuracy for human and yeast.

Our analysis shows that annotation properties which describe biological entities in natural language contribute the most to the performance improvements of OPA2Vec over Onto2Vec. In particular the label and description, synonyms and creator (oboInOwl:created_by) properties result in better, more predictive feature vector representations. Therefore, we limited our analysis to the labels, natural language descriptions, synonyms, and creator name from the ontology meta-data in further analysis.

We previously found that supervised training can significantly improve the predictive performance when comparing these vector representations as it has the potential to ‘‘learn’’ custom, task-specific similarity measures (Smaili et al., 2018). Therefore, we followed a similar strategy here and trained a deep neural network (see Section 4.5) to predict whether two proteins interact given two protein vector representations as inputs. We found that this supervised approach further improves the performance of OPA2Vec (Figure 2).

2.3 Evaluating performance in predicting gene-disease associations

As a second use case to evaluate OPA2Vec and demonstrate its utility, we applied our approach on the PhenomeNET ontology (Rodríguez-García et al., 2017) (see Section 4.1). PhenomeNET is a system for prioritizing candidate disease genes based on the phenotype similarity (Hoehndorf et al., 2011) between a disease and a database of genotype–phenotype associations. Phenotypes refer here to concrete developmental, morphological, physiological, or behavioral abnormalities observed in an organism, such as signs and symptoms which make up a disease (Gkoutos et al., 2005, 2017). The main advantage of PhenomeNET is that it includes the PhenomeNET ontology which integrates several species-specific phenotype ontologies; it can therefore be used to compare, for example, phenotypes observed in mouse models and phenotypes associated with human disease (Hoehndorf et al., 2013). We used the PhenomeNET ontology and added mouse genes and human diseases to the knowledge base as instances; we then associated each instance with a set of phenotypes. We used the phenotypes associated with unconditional, single gene knockouts (i.e., complete loss of function mutations) available from the MGI database (Blake et al., 2017) and associated them with their phenotypes, and we used the disease-to-phenotype file from the HPO database (Köhler et al., 2017) to associate diseases from the Online Mendelian Inheritance in Men (OMIM) (Amberger et al., 2011) database to their phenotypes. In total, our knowledge base consists of 18,920 genes and 7,154 OMIM diseases.

We applied our OPA2Vec algorithm to the combined knowledge base to generate vector representations of genes and diseases. We included only labels, descriptions, synonyms and creators (created_by) as annotation properties as we found them to contribute most to the performance of OPA2Vec. The corpus generated by OPA2Vec therefore consists of the set of asserted and inferred axioms from the PhenomeNET ontology, the set of annotation axioms involving labels, descriptions, synonyms and creators, and the gene and disease phenotype annotations.

We then computed the pairwise cosine similarity between gene vectors and disease vectors, and we trained a neural network in a supervised manner to predict gene-disease associations. We evaluated our results using two datasets of gene-disease associations provided by the MGI database, one containing human disease genes and another containing mouse models of human diseases. Figure 5 shows the ROC curves and AUC values for gene-disease prediction performance of each approach on the human disease genes and mouse models. We compared the obtained results to Resnik similarity and Onto2Vec, and found that OPA2Vec outperforms both Resnik similarity and Onto2Vec in both evaluation sets.

3 Conclusion

We developed the OPA2Vec method to produce vector representations for biological entities in ontologies based on the formal logical content in ontologies combined with the meta-data and natural language descriptions of entities in ontologies. We applied OPA2Vec to two ontologies, the GO and PhenomeNET, and we demonstrated that OPA2Vec can significantly improve predictive performance in applications that rely on the computation of semantic similarity. We also evaluated the individual contributions of each ontology annotation property to the performance of OPA2Vec-generated vectors. Our results illustrate that the annotation properties that describe details about an ontology concept in natural language, in particular the labels and descriptions, contribute most to the feature vectors. We could show that transfer learning, i.e., assigning ‘‘meaning’’ to words by pre-training a Word2Vec model on a large corpus of biomedical literature abstracts, could further significantly improve OPA2Vec performance in our two applications (prediction of protein-protein interactions and prediction of gene-disease associations).

OPA2Vec can comprehensively encode for information in ontologies. Our method is also based on accepted standards for encoding ontologies, in particular the Web Ontology Language (OWL), and has the potential to include or exclude any kind of annotation property in the generation of its features. OPA2Vec also exploits major developments in the biomedical ontologies community: the use of ontologies as community standards, and inclusion of both human- and machine-readable information in ontologies as standard requirements for publishing ontologies (Smith et al., 2007; Matentzoglu et al., 2018). We therefore believe that OPA2Vec has the potential to become a highly useful, standard analysis tool in the biomedical domain, supporting any application in which ontologies are being used.

(a) Human
(b) Mouse
Figure 5: ROC curves and AUC values for gene-disease association prediction for different methods for human and mouse.

4 Methods

4.1 Ontology and annotation resources

We downloaded the Gene Ontology (GO) (Ashburner et al., 2000) in OWL format from http://www.geneontology.org/ontology/ on September 13, 2017. We downloaded the GO protein annotations from the UniProt-GOA website (http://www.ebi.ac.uk/GOA) on September 26, 2017. We removed all annotations with evidence code IEA as well as ND. For validation, we used the STRING database (Szklarczyk et al., 2017) to obtain protein-protein interaction (PPI) data for human (Homo sapiens) and yeast (Saccharomyces cerevisiae), downloaded on September 16, 2017. The yeast PPI network contains 2,007,135 interactions with 6,392 unique proteins, while the human PPI network contains 11,353,057 interactions for 19,577 unique proteins.

We downloaded the PhenomeNET ontology (Hoehndorf et al., 2011; Rodríguez-García et al., 2017) in owl format from the AberOWL repository http://aber-owl.net (Hoehndorf et al., 2015a) on February 21, 2018. We downloaded the mouse phenotype annotations from the Mouse Genome Informatics (MGI) database http://www.informatics.jax.org/ (Smith and Eppig, 2015) on February 21, 2018. We obtained a total of 302,013 unique mouse phenotype annotations. We obtained the disease to human phenotype annotations on February 21, 2018 from the Human Phenotype Ontology (HPO) database http://human-phenotype-ontology.github.io/ (Robinson et al., 2008). We downloaded only the OMIM disease to human phenotype annotations which resulted in a total of 78,208 unique disease-phenotype associations. For gene-disease association prediction validation, we used the MGI_DO.rpt file from the MGI database. This file contains 9,506 mouse gene-OMIM disease associations and 13,854 human gene-OMIM disease associations. To map mouse genes to human genes we used the HMD_HumanPhenotype.rpt file from the MGI database.

To process our ontologies (GO and PhenomeNET), we used the OWL API 4.2.6.(Horridge and Bechhofer, 2011) and the Elk OWL reasoner (Kazakov et al., 2012).

4.2 PubMed

We retrieved the entire collection of article abstracts in the MEDLINE format from the PubMed database https://www.ncbi.nlm.nih.gov/pubmed/ on February 6, 2018. The total number of abstracts collected is 28,189,045. For each abstract, we removed the meta-data (publication date, journal, authors, PMID, etc.), and only kept the title of the article and the text of the abstract.

4.3 Word2Vec

We used the ontologies, the entity annotations as well as the PubMed abstracts as the text corpora. To process this text data we used Word2Vec (Mikolov et al., 2013b, a). Word2Vec is a machine learning model based on neural networks that can be used to generate vector representations of words in a text. Word2Vec is optimized in such a way that the vector representations of words with a similar context tend to be similar. Word2Vec is available in two different models: the continuous bag of word (CBOW) model and the skip-gram model. In this work, we opted for the skip-gram model which has the advantage over the CBOW model of creating better quality vector representations of words which are infrequent in the corpus. This advantage is quite useful in our case since the biological entities we want to get representations for do not necessarily occur frequently in our text corpora. In this work, we pre-trained the Word2Vec model on the set of PubMed abstracts and save the obtained model which we eventually retrained on the ontology studied (the GO ontology and the PhenomeNET ontology). We used gridsearch to optimize the set of parameters of the skip-gram model used in this work. We used the same parameters to train Word2Vec on the PubMed data set and the ontologies data set, except for the which has value 25 for the PubMed model, but changed to 1 before training on the ontology corpus. The parameters we chose are shown in Table 1.

Parameter Definition Default value

Choice of training algorithm

= 1 skip-gram

= 0 CBOW

1
Dimension of the obtained vectors 200
_ Words with frequency lower than this value will be ignored 1
Maximum distance between the current and the predicted word 5
Number of iterations 5
Whether negative sampling will be used and how many ‘‘noise words’’ would be drawn 5

 

Table 1: Parameters used for training the Word2Vec model.

4.4 Similarity

4.4.1 Cosine Similarity

To calculate similarity between the vectors produced by Word2Vec, we used the cosine similarity which measures the cosine angle between the two vectors. Cosine similarity between two vectors and is calculated as

(1)

where is the dot product of and .

4.4.2 Semantic similarity

Resnik semantic similarity measure (Resnik et al., 1999) is one of the most widely used semantic similarity measures for ontologies. This measure is based on the notion of information content which quantifies the specificity of a given concept (term) in the ontology. The information content of a concept is commonly defined as the negative log likelihood, , where is the probability of encountering an instance of concept . Defining information content in this way makes intuitive sense since as probability increases, the more abstract a concept becomes and therefore the lower its information content. Given this definition of information content, Resnik similarity is formally defined as:

(2)

where is the most informative common ancestor of and in the ontology hierarchy, defined as the common ancestor with the highest information content value. Resnik similarity measure does not only have the advantage of being conceptually simple, but it also overcomes the limitation of assuming that all relations represent uniform distances, since in real ontologies, the value of one edge may vary.

Biological entities can have several concept annotations within an ontology. For instance, as a protein can be involved in different biological processes and can carry several molecular functions, it can be annotated by more than one GO terms. Therefore, to calculate semantic similarity between a pair of proteins, or a pair of any biological entities, it is necessary to properly aggregate the similarity between the concepts that they are respectively annotated with. One possible way to achieve this would be to calculate the Best Match Average (BMA) which estimates the average similarity between the best matching terms of two concepts (Azuaje et al., 2005). For two biological entities and , the BMA is given by:

(3)

where is the set of ontology concepts that is annotated with, is the set of concepts that is annotated with, and is the similarity value between concept and concept , which could have been calculated using the Resnik similarity or any other semantic similarity measure.

4.5 Supervised Learning

To improve our PPI prediction and gene-disease association prediction performance, we used a neural network algorithm to train our prediction model. For our PPI prediction, we used 1,015 proteins from the yeast data set for training and 677 randomly selected proteins were used for testing while 2,263 proteins from the human data set were used for training and 1,509 for testing. The positive pairs were all those reported in the STRING database, while the negative pairs were randomly sub-sampled among all the pairs not occurring in STRING, in such a way the cardinality of the positive set and the one of the negative set are equal for the testing and the training datasets.

For the gene-disease association prediction, 6,710 gene-disease associations were used for training and 2,876 were used for testing for the mouse gene-disease association prediction. While for the human gene-disease association prediction, 9,698 associations were used for training and 4,196 for testing. The positive gene-disease association pairs were obtained from the MGI_DO.rpt file; all other associations were considered to be negative.We chose our neural network to be a feed-forward network with four layers: the first layer contains 400 input units; the second and third layers are hidden layers which contain 800 and 200 neurons, respectively; and the fourth layer contains one output neuron. We optimized parameters using a limited manual search based on best practice guidelines (Hunter et al., 2012). We optimized the ANN using binary cross entropy as the loss function.

4.6 Evaluation using ROC curve and AUC

To evaluate our PPI and gene-disease prediction, we used the ROC (Receiver Operating Characteristic) curve which is a widely used evaluation method to assess the performance of prediction and classification models. It plots the true-positive rate (TPR or sensitivity) defined as against the false-positive rate (FPR or specificity) defined as , where is the number of true positives, is the number of false positives and is the number of true negatives. Ideally, a perfect classification model would have a ROC curve that connects the points , and (Yin and Vogel, 2017). Generally, the closer the ROC curve bends towards this ‘‘perfect curve’’ the better the model is. In the context of this work, the ROC curve is used to evaluate PPI prediction of our method as well as competing methods. In this context, the value is the number of protein pairs occurring in STRING regardless of their STRING confidence score which have been predicted as interacting. The value is the number of protein pairs which have been predicted as interacting but do not appear in the STRING network and finally the is the number of protein pairs predicted as non-interacting which do not occur in the STRING database.

In most cases, ROC curves of different methods would most probably overlap which makes the visual test of the ROC curves insufficient to make a formal comparison between different methods (Yin and Vogel, 2017). Thus there is a need for a quantitative measure that summarizes the meaning of a ROC curve and allows more formal comparison between different methods. The most popular such measures is the area under the ROC curve (AUC) which is the integration of the ROC curve over the entire FPR axis (Yin and Vogel, 2017). In this work, the AUC has also been used along with the ROC curve to evaluate the PPI prediction performance.


\@afterheading

Funding

The research reported in this publication was supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) under Award No. FCC/1/1976-04, FCC/1/1976-06, URF/1/2602-01, URF/1/3007-01, URF/1/3412-01, URF/1/3450-01 and URF/1/3454-01.

References

  • Al-Mubaid and Nguyen (2006) Al-Mubaid, H. and Nguyen, H. A. (2006). A cluster-based approach for semantic similarity in the biomedical domain. In Engineering in Medicine and Biology Society, 2006. EMBS’06. 28th Annual International Conference of the IEEE, pages 2713–2717. IEEE.
  • Alshahrani et al. (2017) Alshahrani, M. et al. (2017). Neuro-symbolic representation learning on biological knowledge graphs. Bioinformatics, 33(17), 2723–2730.
  • Amberger et al. (2011) Amberger, J. et al. (2011). A new face and new challenges for Online Mendelian Inheritance in Man (OMIM). Hum Mutat, 32, 564–567.
  • Ashburner et al. (2000) Ashburner, M. et al. (2000). Gene ontology: tool for the unification of biology. Nature genetics, 25(1), 25–29.
  • Azuaje et al. (2005) Azuaje, F. et al. (2005). Ontology-driven similarity approaches to supporting gene functional assessment. In Proceedings of the ISMB’2005 SIG meeting on Bio-ontologies, pages 9–10.
  • Baader et al. (2003) Baader, F. et al. (2003). The Description Logic Handbook: Theory, Implementation and Applications. Cambridge University Press.
  • Blake et al. (2017) Blake, J. A. et al. (2017). Mouse genome database (mgd)-2017: community knowledge resource for the laboratory mouse. Nucleic Acids Research, 45(D1), D723–D729.
  • Bodenreider (2008) Bodenreider, O. (2008). Biomedical ontologies in action: role in knowledge management, data integration and decision support. Yearbook of medical informatics, page 67.
  • Bordes et al. (2013) Bordes, A. et al. (2013). Translating embeddings for modeling multi-relational data. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 2787–2795. Curran Associates, Inc.
  • Gkoutos et al. (2005) Gkoutos, G. V. et al. (2005). Using ontologies to describe mouse phenotypes. Genome biology, 6(1), R5.
  • Gkoutos et al. (2017) Gkoutos, G. V. et al. (2017). The anatomy of phenotype ontologies: principles, properties and applications. Briefings in Bioinformatics.
  • Grau et al. (2008) Grau, B. et al. (2008). Owl 2: The next step for owl. Web Semantics: Science, Services and Agents on the World Wide Web, 6(4), 309–322.
  • Gruber (1995) Gruber, T. R. (1995). Toward principles for the design of ontologies used for knowledge sharing. International Journal of Human-Computer Studies, 43(5-6).
  • Hoehndorf et al. (2013) Hoehndorf, R. et al. (2013). An integrative, translational approach to understanding rare and orphan genetically based diseases. Interface Focus, 3(2), 20120055.
  • Hoehndorf et al. (2011) Hoehndorf, R. et al. (2011). Phenomenet: a whole-phenome approach to disease gene discovery. Nucleic acids research, 39(18), e119–e119.
  • Hoehndorf et al. (2014) Hoehndorf, R. et al. (2014). Mouse model phenotypes provide information about human drug targets. Bioinformatics, 30(5), 719–725.
  • Hoehndorf et al. (2015a) Hoehndorf, R. et al. (2015a). Aber-owl: a framework for ontology-based data access in biology. BMC bioinformatics, 16(1), 26.
  • Hoehndorf et al. (2015b) Hoehndorf, R. et al. (2015b). The role of ontologies in biological and biomedical research: a functional perspective. Briefings in Bioinformatics.
  • Horridge and Bechhofer (2011) Horridge, M. and Bechhofer, S. (2011). The owl api: A java api for owl ontologies. Semantic Web, 2(1), 11–21.
  • Horrocks et al. (2006) Horrocks, I. et al. (2006). The even more irresistible sroiq. In P. Doherty, J. Mylopoulos, and C. A. Welty, editors, KR, pages 57–67. AAAI Press.
  • Hunter et al. (2012) Hunter, D. et al. (2012). Selection of proper neural network sizes and architectures—a comparative study. IEEE Transactions on Industrial Informatics, 8(2), 228–240.
  • Jiang and Conrath (1997) Jiang, J. J. and Conrath, D. W. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. arXiv preprint cmp-lg/9709008.
  • Kazakov et al. (2012) Kazakov, Y. et al. (2012). Elk reasoner: Architecture and evaluation. In ORE.
  • Köhler et al. (2009) Köhler, S. et al. (2009). Clinical diagnostics in human genetics with semantic similarity searches in ontologies. The American Journal of Human Genetics, 85(4), 457 – 464.
  • Köhler et al. (2017) Köhler, S. et al. (2017). The human phenotype ontology in 2017. Nucleic Acids Research, 45(D1), D865–D876.
  • Leacock and Chodorow (1998) Leacock, C. and Chodorow, M. (1998). Combining local context and wordnet similarity for word sense identification. WordNet: An electronic lexical database, 49(2), 265–283.
  • Li et al. (2003) Li, Y. et al. (2003). An approach for measuring semantic similarity between words using multiple information sources. IEEE Transactions on knowledge and data engineering, 15(4), 871–882.
  • Lin et al. (1998) Lin, D. et al. (1998). An information-theoretic definition of similarity. In Icml, volume 98, pages 296–304.
  • Matentzoglu et al. (2018) Matentzoglu, N. et al. (2018). Miro: guidelines for minimum information for the reporting of an ontology. Journal of Biomedical Semantics, 9(1), 6.
  • Mikolov et al. (2013a) Mikolov, T. et al. (2013a). Distributed representations of words and phrases and their compositionality. CoRR, abs/1310.4546.
  • Mikolov et al. (2013b) Mikolov, T. et al. (2013b). Efficient estimation of word representations in vector space. CoRR, abs/1301.3781.
  • Nickel et al. (2016) Nickel, M. et al. (2016). Holographic embeddings of knowledge graphs. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pages 1955–1961. AAAI Press.
  • Pesquita et al. (2009) Pesquita, C. et al. (2009). Semantic similarity in biomedical ontologies. PLoS Comput Biol, 5(7), e1000443.
  • Resnik et al. (1999) Resnik, P. et al. (1999). Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. J. Artif. Intell. Res.(JAIR), 11, 95–130.
  • Ristoski and Paulheim (2016) Ristoski, P. and Paulheim, H. (2016). Rdf2vec: Rdf graph embeddings for data mining. In International Semantic Web Conference, pages 498–514. Springer.
  • Robinson et al. (2008) Robinson, P. N. et al. (2008). The human phenotype ontology: a tool for annotating and analyzing human hereditary disease. The American Journal of Human Genetics, 83(5), 610–615.
  • Rodríguez-García and Hoehndorf (2018) Rodríguez-García, M. Á. and Hoehndorf, R. (2018). Inferring ontology graph structures using owl reasoning. BMC Bioinformatics, 19(1), 7.
  • Rodríguez-García et al. (2017) Rodríguez-García, M. Á. et al. (2017). Integrating phenotype ontologies with phenomenet. Journal of biomedical semantics, 8(1), 58.
  • Smaili et al. (2018) Smaili, F. Z. et al. (2018). Onto2vec: joint vector-based representation of biological entities and their ontology-based annotations. arXiv preprint arXiv:1802.00864.
  • Smith et al. (2007) Smith, B. et al. (2007). The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotech, 25(11), 1251–1255.
  • Smith and Eppig (2015) Smith, C. L. and Eppig, J. T. (2015). Expanding the mammalian phenotype ontology to support automated exchange of high throughput mouse phenotyping data generated by large-scale mouse knockout screens. Journal of biomedical semantics, 6(1), 11.
  • Szklarczyk et al. (2017) Szklarczyk, D. et al. (2017). The string database in 2017: quality-controlled protein–protein association networks, made broadly accessible. Nucleic Acids Research, 45(D1), D362–D368.
  • W3C OWL Working Group (2009) W3C OWL Working Group (2009). Owl 2 web ontology language: Document overview. Technical report, W3C. http://www.w3.org/TR/owl2-overview/.
  • Wu and Palmer (1994) Wu, Z. and Palmer, M. (1994). Verbs semantics and lexical selection. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics, pages 133–138. Association for Computational Linguistics.
  • Yin and Vogel (2017) Yin, J. and Vogel, R. L. (2017). Using the roc curve to measure association and evaluate prediction accuracy for a binary outcome. Biometrics & Biostatistics International Journal, 5(3), 1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
169232
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description