New metrics for learning and inference on sets, ontologies, and functions

New metrics for learning and inference on sets, ontologies, and functions

Ruiyu Yang Yuxiang Jiang Department of Computer Science and Informatics, Indiana University, Bloomington, Indiana, USA Matthew W. Hahn Elizabeth A. Housworth Department of Mathematics, Indiana University, Bloomington, Indiana, USA Predrag Radivojac Department of Computer Science and Informatics, Indiana University, Bloomington, Indiana, USA
Abstract

We propose new metrics on sets, ontologies, and functions that can be used in various stages of probabilistic modeling, including exploratory data analysis, learning, inference, and result interpretation. These new functions unify and generalize some of the popular metrics on sets and functions, such as the Jaccard and bag distances on sets and Marczewski-Steinhaus distance on functions. We then introduce information-theoretic metrics on directed acyclic graphs drawn independently according to a fixed probability distribution and show how they can be used to calculate similarity between class labels for the objects with hierarchical output spaces (e.g., protein function). Finally, we provide evidence that the proposed metrics are useful by clustering species based solely on functional annotations available for subsets of their genes. The functional trees resemble evolutionary trees obtained by the phylogenetic analysis of their genomes.

\addbibresource

refdbshort.bib

1 Introduction

The development of domain-specific machine learning algorithms inevitably requires choices regarding data pre-processing, data representation, model selection and training, and evaluation strategies, among other areas. One such requirement is the selection of similarity or distance functions that play important roles in many stages of data analysis and processing. In a supervised scenario, for example, distance-based algorithms such as the k-nearest neighbor or kernel machines critically depend on the selection of distance functions. Similarly, the entire class of hierarchical clustering techniques relies on the selection of distances between data points that are sensible for a particular application domain. A number of other algorithms rely on the existence of metric spaces implicitly.

While some approaches to data analysis and learning do not require all properties of metrics [Margareta], satisfying these properties is desirable. In data analysis, for example, calculating an average pairwise distance between different groups of data points may present difficulties in interpreting results if well-understood and intuitive geometric properties are violated. Similarly, in learning algorithms, the properties of a metric are important in proving the convergence of algorithms or in achieving computational speed-ups, and consequently better inference outcomes [Elkan, DBScan, Simovici].

In this work we present new classes of metrics on sets and functions and frame several well-known metrics as their special cases. We then show how the new distance functions can be adapted to give information-theoretic metric spaces on the set of functional annotations of biological macromolecules. Finally, we carry out experiments to show that we can (approximately) reconstruct the phylogenetic relationships between species using these metrics based solely on the functional annotations currently available for their genes.

2 Background

2.1 Metrics

Metrics are a mathematical formalization of the everyday notion of distance. Given a non-empty set , a function is called a distance if

  • (non-negativity)

  • (reflexivity)

  • (symmetry)

for . A non-empty set endowed with a distance function is referred to as distance space [book]. A distance function is called a metric if

  • iff (identity of indiscernibles)

  • (triangle inequality)

A non-empty set endowed with a metric is referred to as metric space [book].

2.2 Protein Function and its Functional Annotation

Proteins are biological macromolecules comprising more than 50% of the dry weight of living cells and responsible for a wide range of cellular activities. A totality of a protein’s activity under all environmental conditions is referred to as protein function and is determined through a series of biochemical, biological, and/or genetic studies. The results of such experimental work are initially published in free text and later processed by curators who convert them to ontological annotations in order to standardize knowledge representation.

Biomedical ontologies are typically represented as graphs in which the nodes correspond to biological concepts and edges represent the relationships between these concepts [Robinson2011]. While, in principle, there are no restrictions on the types of graphs used for the ontologies, most ontologies incorporate hierarchical organization of trees or directed acyclic graphs. The most frequently used ontology for the description of protein function is the Gene Ontology [Ashburner2000] that consists of three directed acyclic graphs describing different aspects of a protein’s activity. The Molecular Function Ontology (MFO) describes protein function at the biochemical level; the Biological Process Ontology (BPO) provides a more abstract view in terms of the emergent biological processes a protein is involved in; and the Cellular Component Ontology (CCO) describes the location in or outside the cell where the protein carries out its function.

Formally, we consider an ontology to be a directed acyclic graph with a set of vertices (concepts, terms) and a set of edges (relational ties) . In terms of functional annotations, a protein function can be seen as a consistent subgraph of the larger ontology graph. By saying consistent, we mean that if a vertex belongs to , then all the ancestors of up to the root(s) of the ontology must also belong to . Typically, a subgraph corresponding to an experimentally annotated protein contains 10-100 nodes, whereas the ontology graph consists of 1000-10000 nodes. The concept of protein function is illustrated in Figure 1.

Figure 1: Illustration of a protein’s functional annotation using the Biological Process Ontology. The functional annotation graph (blue nodes) contains two leaf nodes, ‘Lateral growth’ and ‘Isotropic cell growth’, that completely determine the consistent subgraph.

2.3 Comparing Protein Functions

Although the concept of protein function can be standardized through the use of ontologies, comparisons between two functional annotations are far from straightforward [Pesquita2009, Guzzi2012]. This is caused by the dependence and hierarchical relationship between the terms and also by the fact that experimental annotations of proteins are generally incomplete and biased. Terms closer to the root of the ontology are usually more general; however, some parts of GO are significantly more refined than others, causing difficulties in using depth as a proxy for the specificity of a particular concept.

Similarity functions between pairs of proteins can be broadly divided into topological and probabilistic. Topological comparisons are usually node-based or edge-based, but can also incorporate the structure of the ontology. Similarity functions such as Jaccard coefficient and cosine similarity are based on the sets of terms with which the two proteins are annotated. More complex functions incorporate shortest path-based distances [Rada1989], node reachability [Mazandu2012], and others. While simple and generally interpretable, many of these functions do not adequately address the hierarchical nature of the biomedical terms or practical issues such as those related to the unequal specificity of these terms in different parts of the ontology. Probabilistic or information-theoretic similarity measures, on the other hand, incorporate the structure of the ontology but also assume an underlying probabilistic model for the data, where a database of experimentally annotated proteins is used to estimate parameters of the model. Probabilistic similarity functions are usually related to the semantic similarity proposed by Resnik [Resnik1995]. This measure uses a database of proteins to estimate the probability of every node and then computes the similarity between nodes and as , where is the node from with the lowest probability. The deficiencies of Resnik’s similarity have been widely discussed and have led to several modifications [Jiang1997, Lin1998]. However, the main problem of applying these similarity measures to subgraphs of the ontology containing multiple leaf terms (as shown in Figure 1) has not been resolved in a principled manner. In particular, in comparing two functional annotations that contain multiple leaf terms, one inevitably needs to resort to heuristic techniques such as all-pair averaging, best-match averaging, or simply finding the maximum between all pairs of leaves in the two annotation graphs [Resnik1999, Lord2003, Schlicker2006, Verspoor2006].

3 Methods

In this section we introduce new metrics on sets, ontologies, and functions.

3.1 Metrics on Sets

3.1.1 Unnormalized Metrics on Sets

Let be a non-empty set of finite sets drawn from some universe. We define a function as

(1)

where denotes set cardinality, , and is a parameter.

Theorem 3.1.

is a metric space.

Proof of Theorem 3.1.

The only property of a metric not obviously satisfied by is the triangle inequality. Given arbitrary sets , we have

By Minkowski inequality, the first inequality holds. The correctness of the second inequality follows obviously from Figure 2. ∎

Observe that the symmetric distance on sets is a special case of when [book]. Additionally, the bag distance on sets is a special case of when [book].

3.1.2 Normalized Metrics on Sets

Let again be a non-empty set of finite sets drawn from some universe. We define a function as

(2)

where denotes set cardinality, , and is a parameter.

Theorem 3.2.

is a metric space. In addition, .

Proof of Theorem 3.2.

As in the unnormalized case, the only property of a metric that does not clearly satisfy is the triangle inequality. Let be arbitrary sets. If at least one of and holds, then the triangle inequality also holds.

Figure 2: The Venn diagram and notation for the cardinality of elements related to three sets , and .

Without loss of generality, assume that . Let the cardinality of each set be denoted by , , , as shown in Figure 2. Let and . Without loss of generality, assume that .

The second inequality is true due to Minkowski inequality. The third inequality is true since we subtracted the same nonnegative number from both the numerator and denominator of the fraction with the fraction itself remaining in after the subtraction (the numerator is nonnegative before and after the subtraction). Hence, is a metric. It follows that is bounded in via the Minkowski inequality.

Observe that the Jaccard distance is a special case of when .

3.1.3 Relationship to Minkowski distance

Although the new metrics have a similar form to the Minkowski distance on binary set representations, they are generally different. Take for example and from a universe of elements. A sparse set representation results in the following encoding: and . The Minkowski distance of order between and is defined as

and . Substituting the numbers into the expressions above gives and for ; and for , etc. It is worth noticing that for all .

3.2 Metrics on Ontological Annotations

We have previously introduced a concept of information content of a consistent subgraph in an ontology and a measure of functional similarity that can be used to evaluate protein function prediction [Clark2013, Jiang2014]. We briefly review these concepts and then proceed to introduce unnormalized and normalized versions of the semantic distance. We prove that both distances satisfy the properties of a metric.

Suppose that the underlying probabilistic model according to which protein functional annotations have been generated is a Bayesian network structured according to the underlying ontology . That is, we consider that each concept in the ontology is a binary random variable and that the directed acyclic graph structure of the ontology specifies the conditional independence relationships in the network. Then, using the standard Bayesian network factorization we write the marginal probability for any consistent subgraph as

where is the probability that node is part of a functional annotation of a protein given that all of its parents are already part of the annotation. Due to the consistency requirements, the marginalization can be performed in a straightforward manner from the leaves of the network towards the root, excluding all nodes from . This marginalization is reasonable because biological experiments result in incomplete annotations. Thus, treating nodes not in as unknown and marginalizing over them is intuitive. Observe that each conditional probability table in this (restricted) Bayesian network needs to store a single number; i.e., the concept can be present only if all of its parents are part of the annotation. If any of the parents is not a part of the annotation , is guaranteed to not be in .

We now express the information content of a consistent subgraph as

where is referred to as information accretion [Clark2013]. This term corresponds to the additional information inherent to the node under the assumption that all its parents are already present in the annotation of the protein.

We can now compare two protein annotations and [Clark2013]. For the moment, we will assume that annotation is a prediction of . We define misinformation as the cumulative information content of the nodes in that are not part of the true annotation ; i.e., it gives the total information content along all incorrect paths in the ontology. Similarly, the remaining uncertainty gives the overall information content corresponding to the nodes in that are not included in the predicted graph (Figure 3). More formally, the misinformation and remaining uncertainty are defined as

It is easy to see that .

Figure 3: Illustration of the calculation of the remaining uncertainty and misinformation for two protein functions with their Biological Process Ontology annotations: (true, blue) and (predicted, red). The circled nodes contribute to the remaining uncertainty (blue nodes, left) and misinformation (red node, right).

Let now be a non-empty set of all consistent subgraphs generated according to a probability distribution specified by the Bayesian network. We define a function as

(3)

where is a parameter. We refer to the function as semantic distance. Similarly, we define another function as

(4)

where, again, is a parameter. We refer to the function as normalized semantic distance.

Theorem 3.3.

and are metric spaces. In addition, .

Proof of Theorem 3.3.

To show that is a metric is analogous to the proof of Theorem 3.1. Let be arbitrary consistent subgraphs of the ontology. Instead of , we use and similarly for the other cardinalities. The proof follows line for line after these subsititutions to the proof of Theorem 3.1.

To prove in Theorem 3.3 is a metric, we follow a similar argument to the proof of Theorem 3.2. The analogues of here are

With these substitutions, the proof is exactly the same as that of Theorem 3.2.

Invoking the Minkowski inequality, we obtain that

Since is nonnegative, we obtain that .

We note here that the concept of inverse document frequency [Tan2006], often used in text mining, is related to these distances. Suppose the ontology is a tree of depth one (the root node points to all nodes, each being a separate term) and . Then, a Jaccard distance on sparse encoding of inverse document frequency quantities directly reduces to semantic distance. We also note that our metrics provide a mechanism to apply similarity functions directly on text documents, without an intermediate step of feature engineering.

3.3 Metrics on Functions

In this section, we extend the previously introduced metrics to integrable functions and prove that the resulting metric spaces are complete.

3.3.1 Unnormalized Metrics on Functions

Let be a set of bounded integrable functions on . We define as

(5)

where , and is a parameter.

Theorem 3.4.

is a metric space.

Proof of Theorem 3.4.

Since is non-negative, and if and only if almost everywhere, it suffices to show that satisfies the triangle inequality. Let , , and be in . Then we have

Therefore, is a metric. ∎

3.3.2 Normalized Metrics on Functions

Let again be a set of bounded integrable functions on . We define as

(6)

where is a parameter.

Theorem 3.5.

is a metric space. In addition, .

Proof of Theorem 3.5.

It is easy to check that is non-negative, and if and only if almost everywhere. Therefore, it remains to be shown that the inequality is satisfied.

Let , , and be bounded functions in . To begin, let us look at the trivial cases. Define and .

  1. If , then almost everywhere. Consequently and , so the inequality holds.

  2. If , then , in which case the inequality is true due to the non-negativity of .

  3. If , then almost everywhere and ; thus, the triangle inequality still holds.

Next we consider the case where none of the three denominators is zero.

where

Let . By subtracting from the numerator and denominator of at the same time, it follows that

The first of the above inequalities holds since we are subtracting a non-negative number no greater than the non-negative numerator from the top and bottom while the fraction stays in . The equality holds due to Lemma A.3 and the last inequality due to Lemma A.2. By analogy it can be shown that

Thus, we have .

For general functions , we can exclude the set , since the set where the functions are infinite is of measure zero. Thus, the theorem would proceed the same way since Lemma A.1 and Lemma A.2 still hold for .

Now we show that . Since is nonnegative, we only need to show that it is bounded by 1. Applying the Minkowski inequality we have that

Thus the numerator is bounded by the denominator and the fraction is no greater than 1. With for , we have shown that this metric is in . ∎

Observe that the Marczewski-Steinhaus distance [marczewski1958certain] is a special case of when .

Theorem 3.6.

and are complete spaces for .

Proof of Theorem 3.6.

By definition, a metric space is complete if all Cauchy sequences in converge in ; that is, if the limit point of every Cauchy sequence in remains in . Let us first consider a Cauchy sequence in , where for a given , there exists some such that for all ; i.e., . It follows that and and thus . Therefore, is a Cauchy sequence in space, where the metric in is for integrable functions and thus converges to a function in by the completeness of space.

Now we look at a Cauchy sequence in , then by Lemma A.4 we have that for all for some positive constant . It follows that for any given , there exists some integer such that for all , or in other words, . Therefore, is Cauchy in and by previous results we know that has a limit in and therefore is complete. ∎

4 Empirical Investigation

4.1 Phylogenetic Reconstruction Using Protein Functions

To demonstrate the effectiveness of distance measures on protein function annotations we clustered well-annotated species according to their functional annotations and evaluated the similarity between such clusters and known species trees for these organisms. For simplicity, we will refer to the tree derived solely from functional information as a functional phylogeny.

4.1.1 Data Sets

Protein function data were downloaded from the Swiss-Prot database (July 2015) [Bairoch2005]. Because the annotation experiments are generally focused on model organisms, only a limited number of species contained a sufficient number of functional annotations in all three ontologies for the data analysis. In particular, we collected protein function data for the following species: Homo sapiens, Mus musculus, Arabidopsis thaliana, Saccharomyces cerevisiae, and Escherichia coli. Only those annotations with (experimental) evidence codes EXP, IDA, IMP, IPI, IGI, IEP, TAS, and IC were considered. Table 1 summarizes the data sets: here the genome size corresponds to the total number of proteins available for each species in Swiss-Prot. The Molecular Function, Cellular Component and Biological Process columns show the numbers of proteins from each species for which MFO, CCO, and BPO experimental annotations were available.

Organism Genome size MFO BPO CCO
H. sapiens 20,193 11,979 11,398 12,691
M. musculus 16,733 6,728 7,702 7,322
A. thaliana 14,305 4,266 5,749 5,950
S. cerevisiae 6,720 4,051 4,676 4,102
E. coli 4,433 2,272 2,331 2,119
Table 1: Data set sizes for the five organisms used in this work. The genome size refers to the number of protein sequences available in Swiss-Prot for each species.

The conditional probability tables were estimated using the maximum likelihood approach from the entire set of functionally annotated proteins in Swiss-Prot. This set included proteins from species with available MFO annotations, proteins from species with available BPO annotations, as well as proteins from species with available CCO annotations.

4.1.2 Clustering

The functional phylogenetic trees with respect to a group of organisms were generated using single-linkage hierarchical clustering. This algorithm starts by considering every data point (species) to be a cluster of unit cardinality and in each step merges the two closest clusters. The algorithm continues until all original data points belong to the same cluster. The distance between species was based on pairwise distances between functionally annotated proteins as described below. For simplicity, we used normalized semantic distance from Equation (4) with in all experiments.

Without loss of generality, we illustrate the species distance calculation by showing how to compute the distances between A. thaliana (A) and all other organisms. An important challenge in this task arises from unequal genome sizes as well as unequal fractions of experimentally annotated proteins in each species (Table 1), making most distance calculation techniques unsuitable for this task. We therefore used sampling to compare species using a fixed yet sufficiently large set of proteins from each species. The algorithm first samples (with replacement) proteins from each species. It then counts the number of times the proteins from E. coli (E), H. sapiens (H), M. musculus (M) and S. cerevisiae (Y) are functionally most similar to proteins in A. thaliana, with ties resolved uniformly randomly. These counts were used to calculate the directional distances between A. thaliana and the remaining four species. The procedure is repeated times with different bootstrap samples to stabilize the results. The details of the algorithm are shown in Algorithm 1.

Input : Sets of protein functions , where is the number of functionally annotated proteins in organism , for and a metric on ontologies.
Output : Distances and .
begin
       Initialize the bootstrapping sample size and iteration counts . for  do
             Bootstrap proteins , from each organism .;
             for  do
                   for  do
                         ;
                        
                   end for
                  
             end for
            for  do
                   ;
                  
             end for
            
       end for
      for  do
             ;
            
       end for
      
end
Algorithm 1 Computing distances from A. thaliana (A) to E. coli (E), H. sapiens (H), M. musculus (M), and S. cerevisiae (Y) respectively.

After obtaining the distances and , we repeated the above algorithm to determine directional distances starting from E, H, M, and Y. The final distance between any two organisms X and Z, , was determined as an average between two directional distances; i.e., .

It is worth further emphasizing that the sampling procedure in Algorithm 1 was chosen to reduce the influence of incomplete organismal annotations and large differences between available sets functions among organisms. For example, , and in MFO, which might cause a human protein to always be closer to any E. coli protein than any mouse protein, only because of a larger probability of a random close match, when in reality these two organisms are similarly distant from E. coli. The incomplete and biased annotations present a problem for creating functional phylogenies that is markedly different from clustering sets of sequence data, where the entire sequence complement for an organism is usually known and only sequences present in all species are used.

4.1.3 Functional Phylogenies

The entirety of the genetic information present within the model organisms considered here has been obtained, and the analysis of this genetic data has resulted in a well accepted set of phylogenetic relationships among these species. Using the distance measure introduced here, we generated functional phylogenies describing the relationships among these species using only protein function information. If our distance measure works well for the Gene Ontology data, we expect to recover the correct phylogenetic relationships.

Using the MFO and CCO functional annotations our clustering approach did recover the correct relationships among species (Figure 5). This result is gratifying, especially as we might expect many similar functions to be present in the single-celled organisms (E. coli and S. cerevisiae). However, using the BPO annotations did not result in the correct phylogeny, as the positions of S. cerevisiae and A. thaliana were reversed (Figure 5). The accuracy of the MFO and CCO annotations and the inaccuracy of the BPO annotations are consistent with the higher level of functional conservation for the MFO annotations [Nehrt2011], as greater conservation of function could result in more phylogenetic signal within this ontology. As a reminder, this algorithm only produces an unrooted topology among the species. It is up to the experimenters to root the tree with some expert knowledge, as we have done here.

Figure 4: Functional phylogenetic tree for H. sapiens, M. musculus, S. cerevisiae, A. thaliana and E. coli in the Molecular Function and Cellular Component ontologies.
Figure 5: Functional phylogenetic tree for H. sapiens, M. musculus, S. cerevisiae, A. thaliana and E. coli in the Biological Process Ontology.

It is important to mention several practical issues related to the steps used to generate the functional phylogeny. First, as the functional data are biased and incomplete, it was relatively surprising that the available annotations could (approximately) recover the correct phylogenetic tree. We believe that as the biological data improves in quality the functional phylogeny should result in nearly identical results as the genetic sequence phylogeny, or perhaps may even be able to offer additional evolutionary insights [Zhu2015]. This particularly holds for the BPO annotations that are of relatively lower quality and are also less predictable from sequence and molecular data [Radivojac2013]. Second, the choice of a clustering algorithm may affect the outcome of the phylogenetic reconstruction. Nonetheless, we experimented with clustering using single linkage, complete linkage, and group-average strategies for computing distances between clusters. We noticed little change in the resulting phylogenies for either ontology. There was also no dependence on the selection of , where we evaluated , , and (note that was required to be smaller than the smallest term in Table 1).

5 Discussion

In this work we introduced new metrics on sets, ontologies, and functions that can be used in several stages of data processing pipelines, including supervised and unsupervised learning, exploratory data analysis, visualization, and result interpretation. We showed that these metrics are applicable on the space of protein functional annotations and that they have natural information-theoretic interpretation. Our experiments have revealed that our metrics can be used to correctly recover the phylogenetic relationships among species using only protein functions, and that they therefore represent promising measures for future studies of the evolution of function.

In several recent publications, including our own, non-metric similarity functions have been used to reason about functional similarity between biological macromolecules as well as prediction accuracy of automated annotation tools. Assuming non-empty sets and , these include the Maryland bridge distance

where the term in the parentheses corresponds to the the Maryland bridge coefficient [Glazko2005]. When is interpreted as a predicted set of terms for a protein and as a true set of terms, the Maryland bridge coefficient is simply an average of precision () and recall (). Similarly, an often-used harmonic mean between precision and recall, referred to as F-measure [Tan2006], leads only to a near-metric called the Czekanowsky-Dice distance [book]. This distance function can be expressed as

where, again, and are sets and the second term on the right-hand side is the F-measure. While on the surface the process of finding an average pairwise distance for a set of graphs appears to be straightforward, it is unclear what the effect of such operations might be when the concept of triangle inequality is violated. We therefore caution the interpretation of results when non-metric distance functions are used.

Although our experimental investigation only considered biomedical ontologies, the metrics proposed in Equations (1)-(6) have broader applicability. For example, metrics on ontologies can be directly used in the areas of text mining or computer vision in the tasks of joint topic classification [Grosshans2014] or fine-grained classification [Movshovitz2015]. Alternatively, the continuous versions might be readily applicable in comparisons of probability distributions as an alternative to Kullback-Leibler divergence [Kullback1951] that lacks theoretical properties such as symmetry and triangle inequality (the symmetry requirement can be remedied by using the J-divergence [Lin1991]).

Finally, we emphasize that it is difficult to empirically demonstrate that a particular metric is useful in any setting since a common practice in modeling is to initially select an objective function based on domain knowledge and then to develop an algorithm to directly minimize it. Therefore, we believe that the class of functions proposed in this work present sensible choices in various fields and hope that their good theoretical properties will play a positive role in their adoption.

Appendix A Appendix

Lemma A.1.

For any real functions , we have

Proof.

For , it is not hard to see that . When , without loss of generality, suppose that and , then we have

Lemma A.2.

For any real functions and , we have

Proof.

By Lemma A.1, it is equivalent to show

Since , it suffices to show that it is no less than .

Consider the case first. We have or . is located between and , therefore and . Then we have

and

This shows that

When , we have the following two cases

For case 1, we want to show that . This is true since

and

Combining those two inequalities we get .

For case 2, we want to show . By the same analogy, since

and

we have .

Therefore this inequality still holds for . ∎

Lemma A.3.

For any two real functions and , we have

Proof.

First consider the case when . In this case, we have and , this equality holds. Otherwise when , we have and . ∎

Lemma A.4.

Any Cauchy sequence in is bounded in ; i.e., for some constant .

Proof.

We instead prove the contra-positive version of the above statement. Suppose is a sequence in that is unbounded in , or equivalently as . Thus, we have

For any integer , pick to be and we have

where