Enumerating consistent subgraphs of directed acyclic graphs: an insight into biomedical ontologies

Enumerating consistent subgraphs of directed acyclic graphs: an insight into biomedical ontologies

Yisu Peng111Contributed equally to this work. Department of Computer Science,
Indiana University, Bloomington, Indiana, U.S.A.
Yuxiang Jiang11footnotemark: 1 Department of Computer Science,
Indiana University, Bloomington, Indiana, U.S.A.
Predrag Radivojac222Corresponding author, email: predrag@indiana.edu Department of Computer Science,
Indiana University, Bloomington, Indiana, U.S.A.
Abstract

Modern problems of concept annotation associate an object of interest (gene, individual, text document) with a set of interrelated textual descriptors (functions, diseases, topics), often organized in concept hierarchies or ontologies. Most ontologies can be seen as directed acyclic graphs, where nodes represent concepts and edges represent relational ties between these concepts. Given an ontology graph, each object can only be annotated by a consistent subgraph; that is, a subgraph such that if an object is annotated by a particular concept, it must also be annotated by all other concepts that generalize it. Ontologies therefore provide a compact representation of a large space of possible consistent subgraphs; however, until now we have not been aware of a practical algorithm that can enumerate such annotation spaces for a given ontology. In this work we propose an algorithm for enumerating consistent subgraphs of directed acyclic graphs. The algorithm recursively partitions the graph into strictly smaller graphs until the resulting graph becomes a rooted tree (forest), for which a linear-time solution is computed. It then combines the tallies from graphs created in the recursion to obtain the final count. We prove the correctness of this algorithm and then apply it to characterize four major biomedical ontologies. We believe this work provides valuable insights into concept annotation spaces and predictability of ontological annotation.

1 Introduction

Ontologies have become a common means of concept annotation in computational biology and related fields [18]. A protein’s molecular function [1], an effect of a genetic variant [28], or a patient’s diagnosis [19] are typical examples in which biomedical entities such as macromolecules, mutations, or individuals are associated with sets of mutually dependent descriptors. The dependencies between these descriptors are often hierarchical, leading to the use of directed acyclic graphs as concept space representations.

A directed acyclic graph is a pair , where is a set of vertices (nodes) and is a set of directed edges (links) between vertices such that no cycles can be formed. Each vertex in the graph is associated with a unique concept (term, description) and each edge is associated with a particular type of relational tie. For example, when annotating proteins as biomedical entities using the Gene Ontology graph [1], the terms “nucleic acid binding” and “DNA binding” are linked by edges of the type - asserting that DNA binding is a more specific form of nucleic acid binding. Other types of relational ties include -, , and so on.

A typical biomedical entity is associated with a set of terms determined through experiment such as a molecular assay or a diagnostic procedure. A protein, for example, may be assigned terms “DNA binding” and “RNA binding”, neither of which is a generalization of the other. To avoid annotation inconsistencies, this protein must also be annotated by the terms such as “nucleic acid binding” and all others that generalize either of the experimentally determined terms. More broadly, this implies that a biomedical object can only be annotated by a set of terms that respect the hierarchy – a consistent subgraph of the ontology. Unfortunately, (manual) experimental annotation is resource-demanding and often incomplete [16], giving rise to an entire field of computational prediction [17, 11].

The development of computational prediction methods presents its own challenges. Although it can be performed by building a separate binary classifier for each concept in the ontology, this approach is currently competitive only for specialized ranking tasks; e.g., disease-gene prioritization [14], since it does not exploit relationships between the terms. On the other hand, a more complete characterization is via learning structured outputs [25] in which a method takes an object (e.g., a protein) and is asked to provide the totality of concepts with which this object might be associated (i.e., a consistent subgraph). However, the structured-output formulation generally falls under the extreme classification umbrella because the size of the output space is often exceedingly large. This poses problems in measuring similarity between annotations, evaluating accuracy of classification models, and optimization when solving the “argmax problem” [4, 5, 12].

We identify now what we believe is an open problem in computational biology and computer science; that is, efficiently determining the exact number of consistent subgraphs in a given ontology. This problem has a linear-time solution for rooted trees [22], but to our knowledge no such algorithm exists for directed acyclic graphs. This paper therefore proposes a practical solution to this enumeration problem, proves its correctness, analyzes run-time complexity, and introduces various computational speedups. Using this new approach, we analyze four often-used ontologies from the biomedical domain and explore the space of possible annotations. We believe that the algorithms, software, and analysis carried out in this work will lead to better insights into concept annotation spaces and facilitate ontology quality assurance.

2 A Motivating Example

A growing number of concept annotation problems are formulated as the manual or computational assignment of a set of mutually related textual descriptors to some objects of interest. One of such problems is the computational prediction of protein function [5], which can be broadly operationalized as follows:

Given: (1) an amino acid sequence with auxiliary data such as structure, expression, interactions, etc. of a protein with unknown or incomplete function; (2) training data that includes sequences, structures, or systems data corresponding to a (large) set of proteins, some of which have their true biological functions available; (3) a Gene Ontology (GO); i.e., a concept hierarchy used to represent biological functions of proteins in a structured and easy-to-compute-on form.

Objective: provide a set of GO terms that are most likely to be the true (experimental) annotation of .

The objects of interest here are proteins and the set of textual descriptors of protein function is given by GO – an ontology with a directed acyclic graph structure where each node represents a textual descriptor and each edge represents a particular type of a relational tie between two descriptors [1].

Figure 1: The functional annotation of the friend leukemia integration 1 transcription factor isoform 1 [FLI1; Homo sapiens] (RefSeq ID: NP_002008.2) as a consistent subgraph of the molecular function ontology. The arrows in this graph indicate an is-a relationship and are drawn in the reverse direction.

An example of such an annotation is shown in Figure 1, where 8 terms from the molecular function domain have been assigned to this protein. Due to the hierarchical organization of GO, both the set of experimentally determined terms and the set of computationally predicted terms must respect this hierarchy. As shown in this example, the annotation of the term “DNA binding”, implies the annotation of all the other GO terms that conceptually generalize it; e.g., “nucleic acid binding”, “binding”, etc. Typically, the ontology used to represent the annotation space of proteins contains thousands to tens of thousands of terms, whereas the true annotation of a protein consists of tens to at most hundreds of terms. Because the task of a prediction algorithm is to find the most likely annotation, it must devise an efficient procedure to search through the space of all possible annotations.

Most biomedical ontologies have grown over the years to contain a large number of terms. Computationally selecting one such “winning” annotation; i.e., a set of terms, or even providing a short list of most likely annotations, is a significant challenge [12, 25]. This prediction problem thus belongs to a so-called extreme classification scenario because the number of possible (discrete) annotations the algorithm must consider is astronomically large. In fact, we noticed that it is not even possible to give an exact number of possible annotations for a protein. Therefore, an answer to such a simple question (“What is the the number of possible GO annotations a protein can be assigned?”) requires the development of a practical counting algorithm. The resulting counts can, in turn, give insight into the nature and the difficulty of the computational function annotation of biological macromolecules.333Reasonable approximations can be provided by calculating the lower and upper bounds, as we have done later in Section 6. Neither of those, however, provides a full intellectual satisfaction when an exact count can actually be computed.

It is important to mention that the annotation of biological macromolecules is one of the most interesting examples of concept annotation, primarily because of its biomedical significance but also because of the sizes of the available ontologies. Similar situations, however, arise beyond computational biology, as in the fields of text mining [8] and computer vision [15].

3 Preliminaries

3.1 Basic Concepts and Notation

Let be a directed graph, where is a set of vertices representing concepts and is a collection of ordered pairs representing directional relationships, , between two concepts. A sequence of vertices is called a walk if for . A walk of distinct vertices except for the identical starting and ending vertices is called a cycle. A directed graph that does not contain cycles is referred to as directed acyclic graph (DAG).

Given two vertices in a DAG, is said to be an ancestor of and is said to be a descendant of if there exists a walk from to . We denote a set of all ancestors of as and a set of all descendants of as . We next define as the set of extended ancestors of and as the set of extended descendants of . Finally, if , the vertex is said to be a parent of , whereas is said to be a child of . We denote the set of all parents of as and the set of all children of as .

3.2 Transitivity of Relational Ties

When an object is annotated with ontological concepts, it is often considered that all ancestors of those annotated concepts should be automatically assigned to the object. For example, annotating the function of a protein with “enzyme binding” also implicitly annotates it with “protein binding”, “binding” and, finally, the root term “molecular function”. This type of reasoning requires all involved relationships between concepts to be transitive.

Biomedical ontologies, however, usually contain various types of relationships between concepts, some of which are not transitive. Therefore, we only consider - and - relationships, both of which maintain transitivity and permit reasoning about ancestral concepts. It is also worth noting that we define the direction of edges to be pointing from the general terms to specific so that the depth of a node aligns with the increasing resolution of the descriptors. We show in Section 5.2 that the directionality of edges has no impact on the total count. Throughout this work, we consider an ontology to be a DAG, where edges represent transitive relationships.

3.3 Consistent Subgraphs

Let be an ontology and a set of vertices. A subgraph is said to be induced from the original graph by if is the largest subset of pairs from such that both . We denote such vertex-induced subgraph as . We also use to denote the subgraph induced by vertices other than .

Definition 3.1.

A subgraph with respect to the original graph is called consistent if , .

4 Basic Algorithms

4.1 Problem Specification

Given an ontology , our goal is to develop a practical algorithm that enumerates all consistent subgraphs of . We allow the graph to have more than a single root (a vertex with no incoming edges) as well as to be disconnected.

An example of the enumeration problem is shown in Figure 2. We generally observe that the number of consistent subgraphs is bounded from below by , where is the total number of leaf vertices (those with no outgoing edges), and from above by . The structure of the graph, however, determines the exact count and its proximity to either of the bounds. If the input graph is a chain of vertices (), the total number of consistent subgraphs equals . On the other hand, if the original graph is a set of disconnected vertices (), there are consistent subgraphs. This analysis suggests that enumerating consistent subgraphs has a straightforward intractable solution of listing all vertex-induced subgraphs of the ontology and checking for the consistency of each such subgraph.

Figure 2: Consistent subgraphs of an ontology with vertices and edges. Observe that the reversal of all edges in the graph would lead to a reversed graph with the same number of consistent subgraphs (white vertices; Theorem 5.1).

We use to denote the desired function that takes a directed acyclic graph as input and returns the number of consistent subgraphs in that graph. We use and for the special cases where the input graph is a rooted tree or a forest , respectively.

4.2 Counting Consistent Subgraphs in Trees

We first discuss a special case where the input graph is a rooted tree; that is, when each non-root vertex has a single parent. In this case, there exists a linear algorithm in the number of vertices; see Lemma 1 in [22]. We provide this solution in Algorithm 1 with a minor modification resulting from the fact that our algorithm includes an empty tree in the total count. This algorithm naturally extends to collections of rooted trees. One can enumerate subtrees for each tree and take the product as the total count. We refer to this extended algorithm as  (not shown).

Input : A tree , rooted at .
Output : The number of consistent subgraphs in .
1 Function 
2        if  is empty then
3               return 1;
4              
5        else
6               return ;
7              
8        end if
9       
10 end
Algorithm 1 Counting the number of consistent subgraphs in rooted trees [22].

Algorithm 1 recursively traverses a tree in a pre-order manner. For any subtree rooted at vertex , the number of consistent subtrees that contain equals the product of all subcounts from its subtrees rooted at each child. Additionally, we add for the only consistent subtree that does not contain ; i.e., the empty tree. The recursion terminates at the empty tree whose count is one.

Input : A directed acyclic graph .
Output : The number of consistent subgraphs in .
1 Function 
2        if  is a forest then
3               return ;
4              
5        else
6               Pick any vertex as the pivot;
7               return ;
8              
9        end if
10       
11 end
Algorithm 2 Counting the number of consistent subgraphs in directed acyclic graphs.

4.3 Counting Consistent Subgraphs in Directed Acyclic Graphs

Directed acyclic graphs generalize trees in that they allow for multi-parent vertices. Such vertices, however, break Algorithm 1 because the recursive branches are no longer independent. Algorithm 2 circumvents this problem by recursively decomposing a graph into two strictly smaller subgraphs according to a selected pivot vertex. We will show in the next section that the number of consistent subgraphs in the two smaller graphs add up to be the number for the original graph (Line 2, Algorithm 2). The algorithm continues recursive enumeration until the graph becomes a forest, in which case it calls . Figure 3 illustrates the process of graph decomposition with respect to the pivot vertex . We note that any vertex can serve as pivot and will discuss the selection of pivots and how they impact the run time in Sections 5.3 and 6.1.

Figure 3: Illustration of graph decomposition. The enumeration problem of the original graph from panel (a) is split into two subproblems based on the pivot vertex ; shown in panels (b) and (c). The count in (b) corresponds to the number of consistent subgraphs in (a) that do not include , while the count in (c) corresponds to the count of consistent subgraphs in (a) that include . In panel (a), the set of descendants of is shaded in orange and the set of ancestors is shaded in blue.

4.4 Correctness and Complexity of the Algorithm

We first observe that the size of the problem in the number of vertices is guaranteed to decrease during recursive calls, thus ensuring that the algorithm terminates after a finite number of iterations. Next, we justify the equation corresponding to the Line 2 in Algorithm 2,

(1)
Lemma 4.1.

Let be the number of consistent subgraphs in that do not contain . We have .

Proof.

The equal cardinality of the two sets of consistent subgraphs is demonstrated by showing that both sets are contained in each other. For any that induces a consistent subgraph of , it also induces a unique consistent subgraph in . Also, since none of them contains , we have . Conversely, for any consistent subgraph induced by such that , we have by the definition of consistency. Therefore, also induces a consistent subgraph in . That is, .

Lemma 4.2.

Let be the number of consistent subgraphs in that contain . We have .

Proof.

As in Lemma 4.1, for any that induces a consistent subgraph of , also induces a unique consistent subgraph (that contains ) in the original graph. That is, . Also, for any consistent subgraph induced by and , we have by the definition of consistency. Note that the uniqueness of implies the uniqueness of . We can see that the subgraph induced by in is consistent. Given , and being an edge in , we must have as well, due to the consistency of with respect to the original graph. That is, .

Theorem 4.1.

Given an ontology and any , the number of consistent subgraphs in equals the sum of the numbers of consistent subgraphs in and .

Proof.

Equation 1 holds by combining Lemmas 4.1 and 4.2. ∎

To analyze complexity of the algorithm, let be the number of vertices in the graph and be the number of multi-parent vertices. Assuming a multi-parent vertex is always selected as pivot, we can express the run time complexity via the following recurrence

where incorporates the time to select the pivot, split the graph and add two large integers. Let us further assume that the larger of the two graphs after decomposition contains elements, where . It is now straightforward to show that , where if and if .

We can now see that the algorithm is exponential in the worst case; however, it reduces to a polynomial algorithm when or when . Assuming linear time to conduct graph decomposition and a constant time for addition/multiplication, we obtain .

5 Advanced Algorithms

The run-time of the algorithm heavily depends on the structure of the ontology and the selection of pivots. Here we discuss several practical considerations aimed at accelerating Algorithm 2. Once we conclude this discussion, the full method will be presented in Algorithm 3 (Section 6).

5.1 Pruning Branching Components

It is easy to observe that when the ontology consists of multiple connected components, these components can be independently and, if needed, simultaneously processed. We take this reasoning a step further to consider a special scenario of nearly disconnected graphs where (i) the two components are connected via a single vertex and (ii) all vertices in one component are descendants of this vertex.

Definition 5.1.

Given a graph and , is called a branching component if   and . Vertex is called a branching vertex.

Figure 4: Illustration of branching components. Panel (a) shows a branching vertex that separates the graph into a stem component and a branching component . The collection of edges from to is replaced by a zigzag arrow. Panel (b) shows a component-wise tree structure.

Figure 4a gives an example in which is a branching vertex, since the removal of disconnects (i.e., the branching component, ) from the rest of the graph. We refer to the remaining part of the graph as the stem component, . More generally, Figure 4b shows a graph with a component-wise tree structure, where branching vertices serve as hinges of branching component to their corresponding stems. We will use to denote the desired structure.

Given , we demonstrate that can be decoupled into two sequential subproblems: (i) and (ii) . We use for the subtotal of consistent subgraphs in the branching component . We also notice that the entire branching component can be pruned once is computed, making a leaf vertex in . Therefore, we modify the algorithm so as to allow a subtotal count for every vertex as if a branching component has been pruned from . Notice that for all intermediate vertices and original leaves.

With the introduction of , the recursive equation in Algorithm 1 becomes

(2)

Similarly, Equation 1; i.e., Line 2 in Algorithm 2, must be modified to

(3)

where accounts for the fact that for any consistent subgraph in the pruned and any consistent subgraph in , is a distinct consistent subgraph in . The approach naturally extends to multiple (hierachical) branching components such that we compute the subtotal of consistent subgraphs within each component and agglomerate them in a reversed topological order.

The pruning operation is preferred before each instance of decomposition for two main reasons: (i) it divides the problem into smaller non-overlapping subproblems, while a direct decomposition usually results in substantial overlapping subproblems; (ii) although a full parallelization over components is restricted since stem components have to be computed only after all of their branching components are finished, the unordered components can be computed simultaneously. For example, as in Figure 4b, , , and can be computed in parallel.

5.2 Reverse Graphs

Let be the reverse graph of , where . We show that the number of consistent subgraphs in equals that in .

Lemma 5.1.

If is a consistent subgraph of , is a consistent subgraph of .

Proof.

We prove this Lemma by contradiction. For and ,444We use and for ancestors and descendants of in ; and . if , then due to the consistency of . This contradicts . Therefore, the assumption is false and we have , . That is, is consistent. ∎

This Lemma demonstrates that all complementary white vertices in Figure 2 form consistent subgraphs in the reverse graph.

Theorem 5.1.

Given an ontology , .

Proof.

Given Lemma 5.1, we see that the mapping is a bijection between the two sets of consistent subgraphs. Therefore, the two sets are of equal cardinality. ∎

Theorem 5.1 permits graph reversal at any point during the algorithm depending on which of the graphs is more likely to terminate first. For example, we can always choose the one with fewer multi-parent vertices so as to greedily reduce the upper bound of recursive calls. It is worth noting that all the leaves become roots in the reverse graph. Therefore, in the final algorithm that incorporates both pruning and reversing modules, we generalize the algorithm to allow for on roots (branching vertices in the reverse sense) in order to ensure compatibility.

Having on a root indicates that all the ancestors of have been pruned out. For trees (after pruning), we have . With Lemma 4.1 and Theorem 5.1, we have

On the other hand, for any consistent subgraph containing , induces a consistent subgraph in and vice versa; thus,

Hence, these two subtotals sum to be the total count and Equation 2 remains unchanged. However, if a root with is selected to be the pivot, we have the following equation according to Theorem 5.1 and Equation 3,

whereas Equation 3 remains unchanged for non-root vertices.

5.3 Pivot Selection

As alluded to before, the selection of vertices used for partitioning has the potential to significantly change the computation time. It is therefore reasonable to devise a strategy for pivot selection. Besides a random selection of multi-parent vertices (mpv’s), which aims at directly converting DAGs into trees one step at a time, we also consider three other pivot heuristics. The first strategy is to pick a vertex with the maximum degree, with random selection in case of ties, because decomposing the graph according to such vertices may increase the chance of having either disconnected components or branching components. The second strategy selects the pivot so as to minimize over the two subproblems, where , , and are the number of edges, vertices, and roots in the two components. We refer to this quantity as “bound” since it is an upper bound of the number of mpv’s in the graph (see Supplementary Materials for the proof). Note that it is closely related to the cyclomatic number of the graph. Finally, the third strategy simulates a unit network flow for all vertices running in the direction from leaves to the roots and selects the “bottleneck” vertex; i.e., the one that maximizes the ratio of the flow in the vertex and the number of its descendants (see Supplementary Materials for this pivot selection algorithm). These strategies will be empirically compared in Section 6.

5.4 Hashing

It can occur during the recursive procedure that certain subgraphs require repeated enumeration. In Figure 3, for example, the subgraph -- is present in both subproblems shown in Figures 3b-c. Computing the count for this subgraph would emerge in the Figure 3b subproblem if the ensuing decomposition were based on vertex , although it would not emerge if the partitioning were based on vertex . Interestingly, the subgraph - would be counted twice in the Figure 3b subproblem; i.e., when both and are removed, and it would then appear one more time in the Figure 3c subproblem.

To avoid repeated enumeration, whenever a solution to a subproblem is obtained, the count for this subproblem is stored. Then, during the recursive calls, we first check if the result is already available before further calculation. To hash a result, we use the sorted IDs of all vertices in the subgraph as a key. Obviously, this key is unique because it corresponds to a vertex-induced subgraph of . For the pruned subgraph, we store the key of the subgraph along with the branching vertex. Whenever the ID of the branching vertex is used to generate a key, the stored key of the corresponding subgraph is appended to the vertex’s ID with parentheses around it.

Input : A directed acyclic graph
Output : The number of consistent subgraphs in .
1 Function 
2        ;
3        if  then
4               return ;
5              
6        end if
7       if  then // check the number of multi-parent vertices
8               reverse();
9              
10        end if
11       ;
12        foreach connected component  do
13               if  is a tree then
14                      ;
15                     
16               else
17                      ;
18                      foreach  in a reversed topological order do
19                             ;
20                             ;
21                            
22                      end foreach
23                     ;
24                      if  is a root then
25                             ;
26                            
27                      else
28                             ;
29                            
30                      end if
31                     
32               end if
33              ;
34              
35        end foreach
36       ;
37        return ;
38       
39 end
Algorithm 3 The advanced version of Algorithm 2 with optimization modules.

6 Experiments and Results

We empirically evaluate the enumeration procedure from Algorithm 3 and various practical speedups using randomly generated graphs. We then apply this algorithm to four biomedical ontologies to gain insight into the sizes of their concept annotation spaces.

6.1 Run-time Evaluation

We generated two sets of graphs to investigate the efficacy of our algorithm. Each set contained 1000 graphs with either 25 or 100 vertices. To construct each graph the vertices were added sequentially, with the proposed in-degree in-deg() of the -th vertex generated according to a Poisson distribution with parameter . This vertex then became a child of (in-deg(), ) previously generated vertices that were themselves selected uniformly randomly. The parameter was selected according to the prior for each new graph and kept constant until the graph was completed.

With these two sets of simulated graphs, we ran our algorithm with different modules and pivot selection strategies. In particular, we evaluate pivot selection based on (i) random selection of vertices, (ii) random selection of multi-parent vertices, (iii) the degree criterion, (iv) the bound criterion, and (v) the bottleneck criterion. For each pivoting strategy, we subsequently add the pruning component, then hashing, and finally the graph reversal. The criterion for graph reversal was the number of multi-parent vertices; i.e., a graph will be reversed at any point during the recursive process if the reversed graph contains fewer multi-parent vertices.

We report the average wall-time and average number of recursive calls over the two sets of 1000 graphs ( in Table 1; in Table 2). For the smaller graphs, we also ran a brute-force algorithm that was further convenient to empirically evaluate the correctness of our algorithm. We see that simpler schemes perform better on small graphs where the number of recursive calls per graph has not exceeded a few hundreds. On the other hand, the advanced techniques show tangible benefits on the larger graphs reducing the number of recursive calls and total computation time by orders of magnitude. It is possible to envision other variations that could result in further speedups; e.g., selecting multi-parent pivots with the highest degree. These refinements, however, were beyond the scope of this paper.

brute-force module random random mpv min. bound max. degree bottleneck
571ms 22.5ms (313) 5.3ms (39) 25.7ms (23) 1.2ms (28) 18.2ms (47)
21.1ms (97) 14.3ms (44) 26.4ms (25) 10.2ms (28) 25.3ms (44)
19.6ms (71) 14.4ms (39) 26.3ms (23) 10.2ms (26) 24.9ms (34)
19.2ms (67) 7.5ms (28) 25.5ms (23) 7.5ms (23) 23.9ms (31)
Table 1: Experiments with simulated graphs with vertices. Each field in the table summarizes the per-graph wall-time over a set of 1000 graphs as well as the per-graph number of recursive calls, except for the brute-force method. The columns represent pivot selection strategies: (i) random, (ii) random multi-parent vertex (mpv), (iii) minimum bound, (iv) maximum degree, and (v) bottleneck. The rows represent successive additions of practical modules for speedups: () basic approach from Algorithm 2, () pruning, () pruning and hashing, () pruning, hashing, and graph reversal.
module random random mpv min. bound max. degree bottleneck
*3,102s (119,745,876) 5.21s (52,954) 114s (25,416) 9.93s (101,342) 122s (526,925)
*241s (1,727,306) 8.98s (33,271) 3.93s (2,802) 1.10s (3,066) 4.28s (12,597)
*132s (387,910) 7.35s (14,913) 3.67s (1,111) 0.92s (1,107) 3.08s (2,052)
165s (508,521) 4.68s (9,721) 3.22s (1,103) 0.84s (1,079) 2.79s (2,133)
Table 2: Experiments with simulated graphs with vertices, with rows and columns identical to those in Table 1. The entries with an asterisk indicate that a sample of graphs was considered (instead of a full set of 1000) due to the long run-time. The brute-force algorithm was not considered as it was not feasible to compute the count for even a single graph.

6.2 Consistent Subgraphs in Biomedical Ontologies

We use 02/2017 versions of Gene Ontology (GO) and Human Phenotype Ontology (HPO) as the target ontologies and compute the number of consistent subgraphs in each of them. The algorithm is applied to each of the three domains of GO [1]: (i) molecular function ontology (MFO; 10,789 terms) (ii) biological process ontology (BPO; 29,575 terms) and (iii) cellular component ontology (CCO; 4,085 terms). Together with HPO (12,167 terms), these four ontologies are widely used in annotating functional terms of gene products [17, 11]. We further define the annotation level for each term in the ontology to be the length of the longest path to the root. Starting from the root term, we add more specific terms level-by-level so as to understand how the potential annotation space grows with increased granularity of functional concepts.

In addition to level-wise full ontologies, we also investigate the “used” ontologies in which each term was retained only if at least one protein in the UniProt-GOA [9] and HPO [19] databases has been confidently assigned that term (confident annotations include all experimental evidence codes as well as “traceable author statement” and “inferred by curators”). Protein function annotations were extracted from the 02/13/2017 release of the UniProt-GOA database, which contains 64,362 proteins with confident MFO annotations, 84,413 proteins with BPO annotations and 79,630 proteins with CCO annotations. HPO annotations were extracted from the 02/24/2017 release of the HPO database where 6,411 genes with confident annotations were extracted.

Figure 5 shows the completed counts for both full and used level-wise ontologies. For each ontology, we additionally compute the lower bound (generally the larger of and , where is the number of roots) and estimate the upper bound (we convert a graph into a forest by keeping only one randomly selected incoming edge for each multi-parent vertex and then call ).

Figure 5: Number of consistent subgraphs in level-wise GO and HPO. In each panel, solid black lines mark the exact counts for “full” subgraphs and grey dotted lines mark the exact counts for “used” subgraphs. Colored bars indicate the estimated upper/lower bounds of the actual counts. The exact integer counts are available upon request.

Although we were not surprised by the astronomical sizes of concept annotation spaces, it was interesting to quantify them whenever feasible as well as to observe an increasing difference between lower and upper bounds (in the 1000s of orders of magnitude) with the level of the ontology. We also find it interesting that a large number of ontological terms have never been used (see Supplementary Materials). These outcomes raise questions regarding the predictability of ontological annotations as most modern algorithms are asked to provide accurate deep annotations to be deemed useful. However, annotation spaces become exceedingly large almost instantaneously with the depth of the ontology, which presents an immense computational and statistical challenge for any prediction algorithm. We therefore believe that the balance between ontology size/complexity and term granularity might become an important topic for future discussions.

6.3 Entropy of Concept Annotation Spaces

The ability to enumerate subgraphs in relatively large ontologies presents an opportunity to contrast the space of actual ontological annotations in biological databases with the space of possible ontological annotations. To investigate this, we first computed the entropy of actual annotations at different levels in the ontology,

where is the truncated ontology as in Section 6.2, corresponds to a distinct consistent subgraph annotation observed at that level and is the probability that a protein is assigned annotation . We first enumerated all observed subgraphs from the UniProt-GOA or HPO database truncated to a particular level, calculated their relative frequencies, and then plugged these relative frequencies into the entropy formula above. On the other hand, the maximum entropy was computed as by assuming equal probability for every possible consistent subgraphs.

Figure 6: Ratio of entropies in the four ontologies. Circles with solid lines show the ratio of observed entropy to the maximum entropy. Dotted lines correspond to the estimated ratios as the average of the two ratios calculated by lower/upper bound of the counts. The error bars suggest a possible placement for the actual ratio.

Figure 6 shows the ratio between the two quantities for levels greater than 1, suggesting that the world of protein functions, despite great diversity, has low entropy relative to the possible maximum. Although the currently observed functional annotations are incomplete, noisy and biased [23, 24, 10], this suggests considerable departure from the uniform distribution and implies that an extensive number of possible consistent subgraphs have not been used.

7 Related Work

There exists a body of literature in enumerative combinatorics related to our work. One of the most relevant problems is the enumeration of directed acyclic graphs with distinct (labeled) nodes [20]. The resulting count reflects the size of the structure space of Bayesian networks with random variables and, surprisingly, also corresponds to the number of matrices in with all eigenvalues real and positive [13]. The number of labeled directed acyclic graphs with nodes does not have a closed-form solution and is instead available as the A003024 sequence in the On-Line Encyclopedia of Integer Sequences (OEIS); https://oeis.org/A003024. The construction was originally proposed by Robinson [20] and was further investigated by others [26, 21, 6].

The results on rooted labeled trees include both the enumeration of possible number of trees and also the enumeration of subtrees for a given tree. There are labeled rooted trees with nodes [7] that provide the integer sequence A000169 in OEIS; https://oeis.org/A000169. The expansion to forests gives using Cayley’s formula [3], as a single root can be added to connect a forest of unrooted labeled trees into a rooted labeled tree. The recurrence for the number of subtrees of a given tree was proposed by Ruskey [22]; see Algorithm 1. The generalization to weighted subtrees was given by Yan and Yeh [30]. Both algorithms are linear in assuming constant time addition and multiplication.

Our work also relates to the research in ontology quality assurance. These efforts typically include the analysis of irregularities and redundancy in concept descriptors and graph structure [2, 27, 29]. Our work, primarily the software we developed, contributes to this area by facilitating the analysis of the annotation space.

8 Conclusions

This work presents a practical algorithm for enumerating consistent subgraphs of directed acyclic graphs. Although this study largely addresses an intellectual challenge of efficient subgraph enumeration, it might have practical utility for the studies of annotation spaces in the biomedical sciences and beyond; e.g., in text mining [8] and computer vision [15]. As ontologies are easy to grow and hard to manually interrogate, we provide a practical tool that can give insights into the complexity of concept annotation tasks [17, 11]. As such, it may serve as a guide to ontology developers.

Acknowledgements

This work has been supported by the National Science Foundation grant DBI-1458477 and the Indiana University Precision Health Initiative.

References

  • [1] M. Ashburner, C. A. Ball, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet, 25(1):25–29, 2000.
  • [2] O. Bodenreider. Strength in numbers: exploring redundancy in hierarchical relations across biomedical terminologies. AMIA Annu Symp Proc, pages 101–105, 2003.
  • [3] A. Cayley. A theorem on trees. Quart J Math, 23:376–378, 1889.
  • [4] W. T. Clark and P. Radivojac. Information-theoretic evaluation of predicted ontological annotations. Bioinformatics, 29(13):i53–i61, 2013.
  • [5] I. Friedberg and P. Radivojac. Community-wide evaluation of computational function prediction. Methods Mol Biol, 1446:133–146, 2017.
  • [6] I. M. Gessel. Counting acyclic digraphs by sources and sinks. Discrete Math, 160:253–258, 1996.
  • [7] J. L. Gross and J. Yellen. Handbook of graph theory. CRC Press, Boca Raton, Florida, U.S.A., 2004.
  • [8] M. Grosshans, C. Sawade, et al. Joint prediction of topics in a URL hierarchy. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, ECML/PKDD 2014, pages 514–529, 2014.
  • [9] R. P. Huntley, T. Sawford, et al. The GOA database: Gene Ontology annotation updates for 2015. Nucleic Acids Res, 43(Database issue):D1057–D1063, 2015.
  • [10] Y. Jiang, W. T. Clark, et al. The impact of incomplete knowledge on the evaluation of protein function prediction: a structured-output learning perspective. Bioinformatics, 30(17):i609–i616, 2014.
  • [11] Y. Jiang, T. R. Oron, et al. An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biol, 17(1):184, 2016.
  • [12] T. Joachims, T. Finley, and C.-N. J. Yu. Cutting-plane training of structural SVMs. Mach Learn, 77(1):27–59, 2009.
  • [13] B. D. McKay, F. E. Oggier, et al. Acyclic digraphs and eigenvalues of (0,1)-matrices. J Integer Seq, 7:04.3.3, 2004.
  • [14] Y. Moreau and L. C. Tranchevent. Computational tools for prioritizing candidate genes: boosting disease gene discovery. Nat Rev Genet, 13(8):523–536, 2012.
  • [15] Y. Movshovitz-Attias, Q. Yu, et al. Ontological supervision for fine grained classification of street view storefronts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, pages 1693–1702, 2015.
  • [16] S. Poux and P. Gaudet. Best practices in manual annotation with the Gene Ontology. Methods Mol Biol, 1446:41–54, 2017.
  • [17] P. Radivojac, W. T. Clark, et al. A large-scale evaluation of computational protein function prediction. Nat Methods, 10(3):221–227, 2013.
  • [18] P. N. Robinson and S. Bauer. Introduction to bio-ontologies. CRC Press, Boca Raton, Florida, U.S.A., 2011.
  • [19] P. N. Robinson and S. Mundlos. The human phenotype ontology. Clin Genet, 77(6):525–534, 2010.
  • [20] R. W. Robinson. Counting labeled acyclic digraphs. In Proceedings of the 3rd Ann Arbor Conference on Graph Theory, pages 239–273. Academic Press, 1971.
  • [21] V. I. Rodionov. On the number of labeled acyclic digraphs. Discrete Math, 105:319–321, 1992.
  • [22] F. Ruskey. Listing and counting subtrees of a tree. SIAM J Comput, 10(1):141–151, 1981.
  • [23] A. M. Schnoes, S. D. Brown, et al. Annotation error in public databases: misannotation of molecular function in enzyme superfamilies. PLoS Comput Biol, 5(12):e1000605, 2009.
  • [24] A. M. Schnoes, D. C. Ream, et al. Biases in the experimental annotations of protein function and their effect on our understanding of protein function space. PLoS Comput Biol, 9(5):e1003063, 2013.
  • [25] A. Sokolov and A. Ben-Hur. Hierarchical classification of gene ontology terms using the GOstruct method. J Bioinform Comput Biol, 8(2):357–376, 2010.
  • [26] R. P. Stanley. Acyclic orientations of graphs. Discrete Math, 5:171–178, 1973.
  • [27] K. Verspoor, D. Dvorkin, et al. Ontology quality assurance through analysis of term transformations. Bioinformatics, 25(12):i77–i84, 2009.
  • [28] M. Vihinen. Variation Ontology for annotation of variation effects and mechanisms. Genome Res, 24(2):356–364, 2014.
  • [29] G. Xing, G. Q. Zhang, and L. Cui. FEDRR: fast, exhaustive detection of redundant hierarchical relations for quality improvement of large biomedical ontologies. BioData Min, 9:31, 2016.
  • [30] W. Yan and Y. N. Yeh. Enumeration of subtrees of trees. Theor Comput Sci, 369(1-3):256–268, 2006.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
67513
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description