Classification in biological networks with hypergraphlet kernels

# Classification in biological networks with hypergraphlet kernels

## Abstract

Biological and cellular systems are often modeled as graphs in which vertices represent objects of interest (genes, proteins, drugs) and edges represent relational ties among these objects (binds-to, interacts-with, regulates). This approach has been highly successful owing to the theory, methodology and software that support analysis and learning on graphs. Graphs, however, often suffer from information loss when modeling physical systems due to their inability to accurately represent multiobject relationships. Hypergraphs, a generalization of graphs, provide a framework to mitigate information loss and unify disparate graph-based methodologies. In this paper, we present a hypergraph-based approach for modeling physical systems and formulate vertex classification, edge classification and link prediction problems on (hyper)graphs as instances of vertex classification on (extended, dual) hypergraphs in a semi-supervised setting. We introduce a novel kernel method on vertex- and edge-labeled (colored) hypergraphs for analysis and learning. The method is based on exact and inexact (via hypergraph edit distances) enumeration of small simple hypergraphs, referred to as hypergraphlets, rooted at a vertex of interest. We extensively evaluate this method and show its potential use in a positive-unlabeled setting to estimate the number of missing and false positive links in protein-protein interaction networks.

## 1 Introduction

Graphs provide a mathematical structure for describing relationships between objects in a system. Owing to their intuitive representation, well-understood theoretical properties, the wealth of the algorithmic methodology and available code base, graphs have also become a major framework for modeling biological systems. Protein-protein interaction networks, protein 3D structures, drug-target interaction networks, metabolic networks and gene regulatory networks are some of the major representations of biological systems. Unfortunately, molecular and cellular systems are only partially observable and may contain significant amount of noise due to their inherent stochastic nature as well as the limitations of both low-throughput and high-throughput experimental techniques. This highlights the need for the development and application of computational approaches for predictive modeling (e.g., inferring novel interactions) and identifying interesting patterns in such data.

Learning on graphs can be generally seen as supervised or unsupervised. Under a supervised setting, typical tasks involve graph classification; i.e., the assignment of class labels to entire graphs [55], vertex or edge classification; i.e., the assignment class labels to vertices or edges in a single graph [30], or link prediction; i.e., the prediction of the existence of edges in graphs [34]. Alternatively, frequent subgraph mining  [27], motif finding [38], clustering [1], and community detection [17] are traditional unsupervised approaches. Regardless of the category, the development of techniques that capture local/global network structure, measure graph similarity and incorporate domain-specific knowledge in a principled manner lie at the core of all these problems.

The focus of this study is on classification problems across various biological networks. A straightforward approach to this problem is the use of topological and other descriptors (e.g., vertex degree, clustering coefficient, betweenness centrality) that summarize graph neighborhoods. These descriptors straightforwardly lead to vector-space representations of vertices or edges in the graph, after which standard machine learning algorithms can be applied to learn a target function [11, 62]. Another approach involves the use of kernel functions on graphs [58]. Kernels are mappings of pairs of objects from an input space to an output space with special properties, such as symmetry and positive semi-definiteness, that lead to efficient learning. Graph kernels often exploit similar ideas as traditional vector-space approaches. Finally, classification on graphs can be approached using probabilistic graphical models such as Markov Random Fields [30] and related label-propagation [66] or flow-based [40] methods. These “global” formulations are generally well adjusted to learning smooth functions over neighboring nodes.

Despite the success and wide adoption of these methods in machine learning and computational biology, it is well-understood that graph representations suffer from information loss since every edge can only encode pairwise relationships [29]. A protein complex, for instance, cannot be distinguished from a set of proteins that interact only pairwise. Such disambiguation, however, is important in order to understand the biological activity of these molecules. Hypergraphs, a generalization of graphs, naturally capture these higher-order relationships [5]. As we show later, they also provide a representation that can be used to unify several conventional classification problems on (hyper)graphs as a single vertex classification approach on hypergraphs.

In this paper, we present and evaluate a kernel-based framework for the problems of vertex classification, edge classification and link prediction in graphs and hypergraphs. We first use the concepts of hypergraph duality to demonstrate that all such classification problems can be unified through the use of hypergraphs. We then describe the development of edit-distance hypergraphlet kernels for vertex classification in hypergraphs and combine them with support vector machines into a semi-supervised predictive methodology. Finally, we use sixteen biological network data sets, eleven assembled specifically for this work, to provide evidence that the proposed approaches compare favorably to the previously established methods.

## 2 Background

### 2.1 Graphs and hypergraphs

Graphs. A graph is a pair , where is a set of vertices (nodes) and is a set of edges. In a vertex-labeled graph, a labeling function is defined as , where is a finite alphabet. Similarly, in an edge-labeled graph, another labeling function is given as , where is also a finite set. A rooted graph is a graph together with one distinguished vertex called the root. We denote such graphs as , where is the root. A neighborhood graph of a vertex is a rooted graph constructed from such that all nodes at distance at least from (and corresponding edges) are removed.

In this work we focus on undirected (the order of the vertices in each edge can be ignored), simple graphs (graphs without self-loops). Additionally, for the simplicity of presentation, we ignore weighted graphs; i.e., graphs where a non-negative number is associated with each vertex. Generalization of our approach and terminology to directed and weighted graphs is straightforward.

A walk of length in a graph is a sequence of nodes such that , for . If , is called a cycle of length . A path in is a walk in which all nodes are distinct. A connected graph is a graph where there is a path between any two nodes.

Hypergraphs. A hypergraph is a pair , where is the vertex set as previously defined and is a family of non-empty subsets of called hyperedges. As in the case of graphs, one can define a vertex-labeled, edge-labeled, rooted, and neighborhood hypergraphs. A hyperedge is said to be incident with a vertex if and two vertices are called adjacent if there is an edge that contains both vertices. The neighbors of a vertex in a hypergraph are the vertices adjacent to . Two hyperedges are said to be adjacent if their intersection is non-empty. Finally, the degree of a vertex in a hypergraph is given by , whereas the degree of a hyperedge is defined as its cardinality; that is, .

A walk of length in a hypergraph is a sequence of vertices and hyperedges such that for each and . If , is called a cycle of length . A path in a hypergraph is a walk in which all nodes and edges are distinct. A connected hypergraph is a hypergraph where there exists a path between any two nodes.

Isomorphism. Consider two graphs, and . We say that and are isomorphic, denoted as , if there exists a bijection such that if and only if for all . If and are hypergraphs, an isomorphism is defined as interrelated bijections and such that if and only if for all hyperedges . Isomorphic graphs (hypergraphs) are structurally identical. An automorphism is an isomorphism of a graph (hypergraph) to itself.

Edit distance. Consider two vertex- and hyperedge-labeled hypergraphs and . The edit distance between these hypergraphs corresponds to the minimum number of edit operations necessary to transform into , where edit operations are defined as insertion/deletion of vertices/hyperedges and substitutions of vertex and hyperedge labels. Any sequence of edit operations that transforms into is referred to as an edit path; hence, the hypergraph edit distance between and corresponds to the length of the shortest edit path between them. This concept can be generalized to the case where each edit operation is assigned a cost. Hypergraph edit distance then corresponds to the edit path of minimum cost.

### 2.2 Hypergraph duality

Let be a hypergraph, where and . The dual hypergraph of , denoted as , is obtained by constructing the set of vertices as and the set of hyperedges as such that . Figure 1A-B shows two examples of a hypergraph and its dual hypergraph representation . Observe that the hyperedges of the original hypergraph are the vertices of the dual hypergraph , whereas the hyperedges of are constructed using the hyperedges of that are incident with the respective vertices.

### 2.3 Classification on hypergraphs

We are interested in binary classification on hypergraphs. The following paragraphs briefly define three distinct classification problems, formulated here so as to naturally lead to the methodology proposed in the next section.

Vertex classification. Given a set of rooted hypergraphs , where each corresponds to the same, possibly disconnected, hypergraph rooted at a different vertex of interest . Here, one aims to learn some classifier function using a labeled training set , where , as a means of assigning class labels to each unlabeled vertex in . A number of classical problems in computational biology map straightforwardly to vertex classification; e.g., protein function prediction [50], disease gene prioritization [39], and so on.

Hyperedge classification. Given a possibly disconnected hypergraph , the objective is to learn a discriminant function from a labeled training set , where , and infer class annotations for every unlabeled hyperedge in . An example of edge classification is the prediction of types of macromolecular interactions such as positive vs. negative regulation.

Link prediction. Let be a hypergraph with some missing hyperedges and let be all non-existent hyperedges in ; i.e., , where represents all possible hyperedges over . The goal is to learn a target function and infer the existence of all missing hyperedges. Examples of link prediction include predicting protein-protein interactions, predicting drug-target interactions, and so on.

### 2.4 Positive-unlabeled learning

A number of prediction problems in computational biology can be considered within a semi-supervised framework, where a set of labeled and a set of unlabeled examples are used to construct classifiers that discriminate between positive and negative examples. A special category of semi-supervised learning occurs when labeled data contain only positive examples; i.e., where the negative examples are either unavailable or ignored; say, if the set of available negatives is small or biased. Such problems are generally referred to as learning from positive and unlabeled data or positive-unlabeled learning [14]. Many prediction problems in molecular biology belong to the open world category; i.e., due to various experimental reasons, the absence of evidence of class labels is not the evidence of absence. Such problems lend themselves naturally to the positive-unlabeled setting.

Research in machine learning has recently established tight connections between traditional supervised learning and (non-traditional) positive-unlabeled learning. Under mild conditions, a classifier that optimizes the ranking performance; e.g., area under the ROC curve [16], in the non-traditional setting has been shown to also optimize the performance in the traditional setting [15, 6, 37]. Similar relationships have been established in approximating posterior distributions [24, 26] as well as in recovering the true performance accuracy in the traditional setting for a classifier evaluated in a non-traditional setting [25]. The latter two problems require estimation of class priors; i.e., the fractions of positive and negative examples in (representative) unlabeled data [24, 26, 48].

## 3 Methods

### 3.1 Problem formulation

We consider binary classification problems on graphs and hypergraphs and propose to unify all such learning problems through semi-supervised vertex classification on hypergraphs. First, vertex classification falls trivially into this framework. Second, the problems of edge classification in graphs and hyperedge classification in hypergraphs are equivalent to the problem of vertex classification on dual hypergraphs. As discussed in Section 2.2, both graphs and hypergraphs give rise to dual hypergraph representations and, thus, (hyper)edge classification on a graph straightforwardly translates into vertex classification on its dual hypergraph . We note here that vertices with the degree of one in give rise to self-loops in the dual hypergraph . To account for them, we add one dummy node per self-loop with the same vertex label as the original vertex and connect them with an appropriately labeled edge. Third, one can similarly see link prediction as vertex classification on dual hypergraphs, where the set of existing links is treated as positive data, the set of known non-existing links is treated as negative data, and the remaining set of missing links is treated as unlabeled data. This formulation further requires an extension of dual hypergraph representations as follows. Consider a particular negative or missing link in the original graph with its dual hypergraph (Fig. 1C). To make a prediction on this edge , we must first introduce a new vertex in the dual hypergraph as well as modify those hyperedges in that correspond to the vertices in (Fig. 1C). We denote this extended hypergraph as . It now easily follows that the sets of negative and unlabeled examples can be created by considering a collection of extended graphs , one at a time, for all non-existing vertices or a subset thereof.

Since most graph data in biological networks lack large sets of representative negative examples, we approach vertex classification, (hyper)edge classification and link prediction as instances of vertex classification on (extended, dual) hypergraphs in a positive-unlabeled setting. We believe this is a novel and useful attempt at generalizing three distinct graph classification problems in a common kernel-based semi-supervised setting. The following sections introduce hypergraphlet kernels that are the core of our classification approach.

### 3.2 Hypergraphlets

Hypergraphlets. Inspired by graphlets [44, 43], we define hypergraphlets as small, simple, connected, rooted hypergraphs. A hypergraphlet with vertices is called an -hypergraphlet; and the -th hypergraphlet of order is denoted as . We consider hypergraphlets up to isomorphism and will refer to these isomorphisms as root- and label-preserving isomorphisms when hypergraphs are rooted and labeled. Figure 2 displays all non-isomorphic unlabeled -hypergraphlets with up to three vertices. There is only one hypergraphlet of order (; Fig. 2A) and one hypergraphlet of order (; Fig. 2B). On the other hand, there are nine hypergraphlets of order (; Fig. 2C) and hypergraphlets of order (not shown). We refer to all these hypergraphlets as base hypergraphlets since they correspond to the case when .

Consider now a vertex- and hyperedge-labeled (or fully labeled for short) hypergraphlet with vertices and hyperedges, where and denote the vertex-label and hyperedge-label alphabets, respectively. If and/or , then automorphic structures with respect to the same base hypergraphlet may exist; hence, the number of fully labeled hypergraphlets per base structure is generally smaller than . For example, if one only considers vertex-labeled -hypergraphlets, then there are vertex-labeled hypergraphlets corresponding to the asymmetric base hypergraphlets , and but only corresponding to the base hypergraphlets , , , , , ; see Table 5. This is a result of symmetries in the base hypergraphlets that give rise to automorphisms among vertex-labeled structures. Similarly, if , then new symmetries may exist with respect to the base hypergraphlets that give rise to different automorphisms among hyperedge-labeled structures. In Section .1, we provide a more detailed discussion on these symmetries. The relevance of these symmetries and enumeration steps relates to the dimensionality of the Hilbert space in which the prediction is carried out.

### 3.3 Hypergraphlet kernels

Motivated by the case for graphs [52, 56, 35], we introduce hypergraphlet kernels. Let , be a fully labeled hypergraph where is a vertex-labeling function , is a hyperedge-labeling function , and . The vertex- and hyperedge-labeled -hypergraphlet count vector for any vertex is defined as

 ϕn(v)=(φn1(v),φn2(v),…,φnκ(n,Σ,Ξ)(v)), (1)

where is the count of the -th fully labeled -hypergraphlet and is the total number of vertex- and hyperedge-labeled -hypergraphlets. A kernel function between the -hypergraphlet counts for vertices and is defined as an inner product between and ; i.e.,

 kn(u,v)=⟨ϕn(u),ϕn(v)⟩. (2)

The hypergraphlet kernel function incorporating all hypergraphlets up to the size is given by

 k(u,v)=N∑n=1kn(u,v), (3)

where is a small integer. In this work we use due to the exponential growth of the number of base hypergraphlets.

### 3.4 Edit-distance hypergraphlet kernels

Consider a fully labeled hypergraph . Given a vertex , we define the vector of counts for a -generalized edit-distance hypergraphlet representation as

 ϕ(n,τ)(v)=(ψ(n1,τ)(v),ψ(n2,τ)(v),…,ψ(nκ(n,Σ,Ξ),τ)(v)), (4)

where

 ψ(n,τ)(v)=∑nj∈E(ni,τ)c(ni,nj)⋅φnj(v). (5)

Here, is the set of all -hypergraphlets such that for each there exists an edit path of total cost at most that transforms into and is a user-defined constant. In words, the counts for each hypergraphlet are updated by also counting all other hypergraphlets that are in the vicinity of . The function can be used to adjust the weights of these pseudocounts. We set for all and and the cost of all edit operations was also set to . This restricts to nonnegative integers.

The length- edit-distance -hypergraphlet kernel between vertices and can be computed as an inner product between the respective count vectors and ; i.e.,

 k(n,τ)(u,v)=⟨ϕ(n,τ)(u),ϕ(n,τ)(v)⟩. (6)

Finally, the length- edit-distance hypergraphlet kernel function is given as

 kτ(u,v)=N∑n=1k(n,τ)(u,v). (7)

The edit operations considered here incorporate substitutions of vertex labels, substitutions of hyperedge labels, and insertions/deletions (indels) of hyperedges. Given these edit operations, we also define three subclasses of edit-distance hypergraphlet kernels referred to as vertex label-substitution , hyperedge label-substitution and hyperedge-indel kernels .

Although the functions from Equations (2) and (6) are defined as inner products, other formulations such as radial basis functions can be similarly considered [51]. We also note that the combined kernels from Equations (3) and (7) can be generalized beyond linear combinations [51]. For the simplicity of this work, however, we only explore equal-weight linear combinations and normalize the functions from Equations (2) and (6) using a cosine transformation.

### 3.5 Computational complexity

The implementation and the analysis of hypergraphlet kernels is an extension of the available solutions for string kernels [49]. Let be a neighborhood hypergraph, as defined in Section 2.1 and suppose it is significantly smaller than the original hypergraph . The hypergraphlet counting algorithm takes steps, where is the maximum degree of a vertex. Similarly, the generation of the minimum cost edit path takes per single hypergraphlet edit operation. Therefore, for each vertex an order of

 O(min{|V(v)|n,κ(n,Σ,Ξ)}(n(|Σ|+|Ξ|)+(n2|Ξ|))τ)

operations are necessary, where the term enumerates possible -hypergraphlets in . Note that the possible number of edges in a hypergraph can be significantly larger than the possible number of edges in a standard graph. Hence, in a practical setting, the edit distance hypergraphlet kernels could greatly benefit from effective sampling techniques or exploitation of special types of hypergraphlets. The proposed implementation for computing hypergraphlet kernel functions is computed in time linear in the number of non-zero elements.

## 4 Experiment design

In this section we summarize classification problems, data sets, and evaluation methodology. The hypergraphlet kernels were evaluated on the problems of edge classification and link prediction, both of which require generation of dual hypergraphs followed by the subsequent vertex classification approach.

### 4.1 Data sets

Protein-protein interaction data. The protein-protein interaction (PPI) data was used for both edge classification and link prediction. In the context of edge classification, we are given a PPI network where each interaction is annotated as either direct physical interaction or a co-membership in a complex. The objective is to predict the type of each interacting protein pair as physical vs. complex (PC). For this task, we used the budding yeast S. cerevisiae PPI network assembled by Ben-Hur and Noble [4].

Another important task in PPI networks is discovering whether two proteins interact. Despite the existence of high-throughput experimental methods for determining interactions between proteins, the PPI network data of all organisms is incomplete [59]. Furthermore, high-throughput PPI data contains a potentially large fraction of false positive interactions [59, 19, 33]. Therefore, there is a continued need for computational methods to help guide experiments for identifying novel interactions. Under this scenario, there are two classes of link prediction algorithms: (1) prediction of direct physical interactions [18, 47, 36, 4] and (2) prediction of co-membership in a protein complex [64, 46]. In this paper, we focused on the former task and assembled nine species-specific data sets comprised solely of direct protein-protein interaction data derived from public databases (BIND, BioGRID, DIP, HPRD, and IntAct) as of January 2017. We considered only one protein isoform per gene and used experimental evidence types described by Lewis et al. [33]. Specifically, we constructed link prediction tasks for: (1) bacterium E. coli (EC), (2) budding yeast S. cerevisiae (SC), (3) nematode worm C. elegans (CE), (4) thale cress A. thaliana (AT), (5) fruit fly D. melanogaster (DM), (6) human H. sapiens (HS), (7) fission yeast S. pombe (SP), (8) brown rat R. norvegicus (RN), and (9) house mouse M. musculus (MM).

Drug-target interaction data. Identification of interactions between drugs and target proteins is an area of growing interest in drug design and therapy [63, 61]. In a drug-target interaction (DTI) network, nodes correspond to either drugs or proteins and edges indicate that a protein is a known target of the drug. Here we used DTI data for both edge classification and link prediction. In the context of edge labeling, we are given a DTI network where each interaction is annotated as direct (binding) or indirect, as well as assigned modes of action as activating or inhibiting. The objective is to predict the type of each interaction between proteins and drug compounds. For this task, we derived two data sets: (1) indirect vs. direct (ID) binding derived from MATADOR, and (2) activation vs. inhibition (AI) assembled from STITCH. Under link prediction setting, the learning task is to predict drug-target protein interactions. In particular, we focus on four drug-target classes: (1) enzymes (EZ), (2) ion channels (IC), (3) G protein-coupled receptors (GR), and (4) nuclear receptors (NR); originally assembled by Yamanishi et al. [63]. Table 1 summarizes all data sets used in this work.

### 4.2 Integrating domain knowledge via vertex alphabet

To incorporate domain knowledge into the PPI networks, we exploited the fact that each vertex (protein) in the graph is associated with its amino acid sequence. Two methods were used to develop vertex alphabet. First, we mapped each protein into a vector of -mer () counts and then applied hierarchical clustering on these count vectors. A result of the clustering step assigned one of the vertex labels for each node. Second, we used protein sequences to predict their molecular and biological function (Gene Ontology terms) using the FANN-GO algorithm [12]. Hierarchical clustering was subsequently used on the predicted term scores to group proteins into broad functional categories. In the case of DTI data, target proteins were annotated in a similar manner. For labeling drug compounds, we used the chemical structure similarity matrix computed from SIMCOMP [20], transformed it into a dissimilarity matrix and then applied hierarchical clustering to group compounds into structural categories.

### 4.3 Evaluation methodology

For each data set, we evaluated all hypergraphlet kernels by comparing them to two in-house implementations of random walk kernels on hypergraphs. The random walk kernels were implemented as follows: given a hypergraph and two vertices and , simultaneous random walks and were generated from and using random restarts. However, in contrast to random walks on standard graphs, a random walk in a hypergraph is a two-step process such that at each step one must simultaneously (1) pick hyperedges and incident with current vertices and respectively, and (2) pick destination vertices and . This process is repeated until a pre-defined number of steps is reached. In the conventional random walk implementation on hypergraphs, a walk was scored as 1 if the entire sequences of vertex and hyperedge labels between and matched; otherwise, a walk was scored as 0. After 10,000 steps, the scores over all walks were summed to produce a kernel value between and . In order to construct a random walk similar to the hypergraphlet edit distance approach, a cumulative random walk kernel was also implemented. Here, any match between the labels of vertices and , or hyperedges and in the -th step of each walk was scored as 1, while a mismatch was scored as 0. Thus, a walk of length could contribute between and to the total count. In each of the random walks, the probability of restart was selected from a set and the result with the highest accuracy is reported. On the PPI data sets we also evaluated the performance of pairwise spectrum kernels [4]. The -mer size was varied from and the result with the highest accuracy is reported. Finally, in the case of the edit distance kernels, we computed the set of normalized hypergraphlet kernel matrices using , , , and for all pairs obtained from a grid search over , and . The result with the highest accuracy is reported.

The performance of each method was evaluated through a 10-fold cross-validation. In each iteration, 10% of nodes in the network are selected for the test set, whereas the remaining 90% are used for training. Support vector machine (SVM) classifiers were used to construct all predictors and perform comparative evaluation. We used SVM with the default value for the capacity parameter [28]. Once each predictor was trained, we used Platt’s correction to adjust the outputs of the predictor to the 0-1 range [41]. Finally, we estimated the area under the ROC curve (AUC), which plots the true positive rate (sensitivity, ) as a function of false positive rate (1 - specificity, ).

## 5 Results

### 5.1 Performance analysis on edge classification

We first evaluated the performance of hypergraphlet kernels in the task of predicting the types of interactions between pairs of proteins in a PPI network, as well as interaction types and modes of action between proteins and chemicals in DTI data. As described in Section 3.1 we first converted the input hypergraph to its dual hypergraph and then used the dual hypergraph for vertex classification. Table 2 lists the AUC estimates for each method and data set. Figure 3 shows ROC curves for one representative data set from each classification task and network type. Observe that the edit distance kernel () outperformed the traditional hypergraphlet kernel () on all data sets. Edit distance kernels achieved the highest AUCs on two of the three data sets over random walk kernels. Therefore, these results provide evidence of the feasibility of this alternative approach to edge classification via exploiting hypergraph duality.

### 5.2 Performance analysis on link prediction

The performance of hypergraphlet kernels was further evaluated on the problem of link prediction on multiple PPI and DTI network data sets. Tables 3 and 4 show the performance accuracies for each hypergraph-based method across all link prediction data sets. These results demonstrate good performance of our methods, with edit-distance kernels generally having the best performance. The primary objective of our study was to present a new approach whose value will increase as biological data becomes more frequently modeled by hypergraphs. At this time, such data sets are not readily available.

### 5.3 Estimating interactome sizes

We used the AlphaMax algorithm  [24] for estimating class priors in positive-unlabeled learning to estimate the number of missing links and misannotated (false positives) interactions on each PPI network. For example, if we assume a tissue and cellular component agnostic model (i.e., any two proteins can interact), we obtained that the number of missing interactions on the largest component of the human PPI network (see Table 1) is about 5% (i.e., approximately 2.5 million interactions), while the number of misannotated interactions is close to 11% which translates to about 4,985 interactions. In the case of yeast, we computed that less than 1% of the potential protein interactions are missing which is close to 95,000. The number of misannotated interactions is close to 13%, which is about 3,400 misannotated protein pairs. Some of these numbers fall within previous studies that suggest that the size of the yeast interactome is between 13,500  [53] and 137,000  [22]; however, the size of the human interactome is estimated to be within 130,000 [57] and 650,000  [53] interactions. A recent paper by Lewis et al. [33] presents a scenario where yeast and human interactome size could reach 400,000 and over two million interactions, respectively. In any case, we note that these estimates were made as a proof of concept for the proposed methodology under the assumption of representative positive data. They however can serve as further validation of the usefulness of our problem formulation and underlying methodology. Additional tests and experiments, potentially involving exhaustive classifier and parameter optimization, will be necessary for more accurate and reliable estimates, especially for understanding the influence of potential biases within the PPI network data.

## 6 Related work

The literature on the similarity-based measures for learning on hypergraphs is relatively scarce. Most studies revolve around the use of random walks for clustering that were first used in the field of circuit design [13]. Historically, typical hypergraph-based learning approaches can be divided into (1) tensor-based approaches, which extend traditional matrix (spectral) methods on graphs to higher-order relations for hypergraph clustering [13, 10, 32], and (2) approximation-based approaches that convert hypergraphs into standard weighted graphs and then exploit conventional graph clustering and (semi-) supervised learning [2, 65]. The methods from the first category provide a direct and mathematically rigorous treatment of hypergraph learning, although most tensor problems are NP-hard. As a consequence, this line of research remains largely unexplored despite a renewed interest in tensor decomposition approaches [21, 45]. Regarding the second category, there are two commonly used transformations for graph-based hypergraph approximation: (1) the star expansion and (2) the clique expansion. These methods are reviewed and compared by Agarwal et al. [1].

Under a supervised learning framework, Wachman and Khardon [60] propose random walk-based hypergraph kernels on ordered hypergraphs, while Sun et al. [54] present a hypergraph spectral learning formulation for multi-label classification. More recently, Bai et al. [3] introduced a hypergraph kernel that transforms a hypergraph into a directed line graph and computes a Weisfeiler-Lehman isomorphism test between directed graphs. A major drawback of most such approaches is that no graph representation fully captures the hypergraph structure. For instance, Ihler et al. [23] have shown that it is impossible to have an exact representation of a hypergraph via a graph while still retaining its cut properties. Therefore, there is a need for a robust hypergraph-based methodology for learning directly on hypergraph data.

## 7 Conclusions

This paper presents a learning framework for the problems of vertex classification, (hyper)edge classification, and link prediction in graphs and hypergraphs. The key to our approach is the use of hypergraph duality in order to cast each classification problem as an instance of vertex classification. This work also presents a new family of kernel functions defined directly on hypergraphs. Using the terminology of Bleakey et al. [7], our method belongs to the category of “local” techniques. That is, it captures the structure of local neighborhoods, rooted at the vertex of interest, and should be distinguished from “global” models such as Markov Random Fields or diffusion kernels [31]. The body of literature on graph learning is vast. We therefore selected to perform extensive comparisons against a limited set of methods that are most relevant to ours.

The development of hypergraphlet kernels derives from the graph reconstruction conjecture, an idea of using small graphs to probe large graphs [8, 9]. Hypergraphlet kernels prioritize accuracy over run time and, it may be argued, do not follow some recent trends in machine learning that generally trade off accuracy for improved scalability and real-time performance. We therefore propose that hypergraphlet kernel approaches, in particular those based on edit distances, be predominantly used on sparse graphs of moderate size. Fortunately, all graphs used in this work fall into that category. Increased accuracy, in general, benefits experimental biologists who typically use prediction to prioritize targets for experimental validation.

The proposed methodology was evaluated on multiple data sets for edge classification and link prediction in biological networks. The results show that hypergraphlet kernels are competitive with other approaches and readily deployable in practice. Through limited tests, we also find that combining hypergraphlet kernels with pairwise spectrum kernels achieves better accuracy than either of the methods does individually.

## 8 Acknowledgments

We thank Matthew Carey for his help in implementing hyperedge-indel kernels. This work was partially supported by the National Science Foundation (NSF) grant DBI-1458477, National Institutes of Health (NIH) grant R01 MH105524, and the Indiana University Precision Health Initiative.

## Appendix

### .1 Enumeration of labeled hypergraphlets

Here we characterize the feature space of fully labeled hypergraphlets by describing the dimensionality of count vectors . We are interested in the order of growth of as a function of , and .

Suppose that and are base hypergraphlets with vertices and hyperedges. We say that and belong to the same equivalence class if and only if the total number of (non-isomorphic) fully labeled hypergraphlets corresponding to the base cases and are equal for any and . The total counts of labeled hypergraphlets over all alphabet sizes induce a partition of base hypergraphlets into equivalence classes. We denote the set of all equivalence classes over the hypergraphlets of order as . For example, the set of vertex- and hyperedge-labeled -hypergraphlets can be partitioned into either: two symmetry classes when : and , or seven symmetry classes when : , , , , , and . Table 5 summarizes equivalence classes induced by partitioning base hypergraphlets up to the order of along with the cardinality of each set. Overall, observe that the cardinality of can be significantly larger than those reported for graphlets [35] because the possible number of hyperedges in a hypergraphlet is generally much larger than the possible number of edges in a graphlet. Additionally, hyperedge-labels require base hypergraphlets and to have an equal number of hyperedges.

This approach can be generalized to hypergraphlets labeled by any alphabet and , such that

 κ(n,Σ,Ξ)=|S(n)|∑i=1mi(n,Σ,Ξ)⋅|Si(n)|,

where is the number of (non-isomorphic) fully labeled hypergraphlets corresponding to any base hypergraphlet from the equivalence class . We use this decomposition to compute the total dimensionality of the count vectors by first finding the equivalence classes corresponding to the base hypergraphlets and then counting the number of labeled hypergraphlets for any one member of the group.

In the case of undirected fully labeled hypergraphlets, can also be computed by applying the theory of enumeration developed by Pólya [42]. In order to get the derivation of the complete generating function for each equivalence class , we first define the automorphism group of a given vertex- and hyperedge-labeled hypergraph . That is, in the case of fully-labeled hypergraphs, set is a collection of permutations (automorphisms) of and . Therefore, the counting problem can be re-formulated as follows: Let be a base hypergraphlet of vertices and hyperedges, and be the automorphism group of over and . Then, each permutation can be written uniquely as the product of disjoint cycles such that for each integer (), we define () as the number of cycles of length () in the disjoint cycle expansion of . Interestingly, the generalized formula for the cycle index of , denoted as , is a polynomial in given by

 Z(A;s1,…,sn;s′1,…,s′m)=1|A|∑α∈An∏k=1m∏k′=1sjk(α)k⋅sjk′(α)k′.

By applying Pólya’s theorem in the context of enumerating vertex- and hyperedge-labeled hypergraphlets corresponding to any base hypergraphlet in , we get that is determined by substituting for each variable and for each variable in . Hence,

 mi(n,Σ,Ξ)=Z(A;|Σ|,|Σ|,…,|Σ|;|Ξ|,|Ξ|,…,|Ξ|),

where is the automorphism group of a base hypergraphlet from . As an example, consider the equivalence class with and (Figure 2 illustrates an unlabeled version of hypergraphlet ). The automorphism group ; thus, . Therefore, it follows that,

### References

1. S. Agarwal, K. Branson, and S. Belongie. Higher order learning with graphs. In Proc. 23rd International Conference on Machine Learning, ICML ’06, pp. 17–24, 2006.
2. S. Agarwal, J. Lim, L. Zelnik-Manor, P. Perona, D. Kriegman, and S. Belongie. Beyond pairwise clustering. In Proc. 18th Conference on Computer Vision and Pattern Recognition, CVPR ’05, pp. 838–845, 2005.
3. L. Bai, P. Ren, and E. R. Hancock. A hypergraph kernel from isomorphism tests. In Proc. 22nd International Conference on Pattern Recognition, ICPR ’14, pp. 3880–3885, 2014.
4. A. Ben-Hur and W. S. Noble. Kernel methods for predicting protein-protein interactions. Bioinformatics, 21(Suppl 1):i38–i46, 2005.
5. C. Berge. Graphs and Hypergraphs. North-Holland, 1973.
6. G. Blanchard, G. Lee, and C. Scott. Semi-supervised novelty detection. J Mach Learn Res, 11:2973–3009, 2010.
7. K. Bleakley, G. Biau, and J-P. Vert. Supervised reconstruction of biological networks with local models. Bioinformatics, 23(13):i57, 2007.
8. J. A. Bondy and R. L. Hemminger. Graph reconstruction-a survey. J Graph Theory, 1(3):227–268, 1977.
9. C. Borgs, J. Chayes, L. Lovász, V. T. Sós, and K. Vesztergombi. Counting graph homomorphisms. In Topics Discrete Math, Algorithms and Combinatorics, pp. 315–371. Springer Berlin Heidelberg, 2006.
10. S. R. Bulò and M. Pelillo. A game-theoretic approach to hypergraph clustering. In Proc. 22nd Advances in Neural Information Processing Systems, NIPS ’09, pp. 1571–1579, 2009.
11. F. Chung-Graham. Spectral Graph Theory. CBMS Regional Conference Series in Mathematics, 1997.
12. W. T. Clark and P. Radivojac. Analysis of protein function and its prediction from amino acid sequence. Proteins, 79(7):2086–2096, 2011.
13. J. Cong, L. Hagen, and A. Kahng. Random walks for circuit clustering. In Proc. 4th International ASIC Conference, ASIC ’91, pp. P14–2.1–P14–2.4, 1991.
14. F. Denis, R. Gilleron, and F. Letouzey. Learning from positive and unlabeled examples. Theor Comput Sci, 348(1):70–83, 2005.
15. C. Elkan and K. Noto. Learning classifiers from only positive and unlabeled data. In Proc. 14th International Conference on Knowledge Discovery and Data Mining, KDD ’08, pp. 213–220, 2008.
16. T. Fawcett. An introduction to ROC analysis. Pattern Recogn Lett, 27:861–874, 2006.
17. S. Fortunato. Community detection in graphs. Phys Rep, 486(3-5):75–174, 2010.
18. S. M. Gomez, W. S. Noble, and A. Rzhetsky. Learning to predict protein-protein interactions from protein sequences. Bioinformatics, 19(15):1875–1881, 2003.
19. G. T. Hart, A. K. Ramani, and E. M. Marcotte. How complete are current yeast and human protein-interaction networks? Genome Biol, 7(11):120, 2006.
20. M. Hattori, Y. Okuno, S. Goto, and M. Kanehisa. Development of a chemical structure comparison method for integrated analysis of chemical and genomic information in the metabolic pathways. JACS, 125(39):11853–11865, 2003.
21. M. Hein, S. Setzer, L. Jost, and S. S. Rangapuram. The total variation on hypergraphs - learning on hypergraphs revisited. In Proc. 26th Advances in Neural Information Processing Systems, NIPS ’13, pp. 2427–2435, 2013.
22. H. Huang, B. M Jedynak, and J. S. Bader. Where have all the interactions gone? Estimating the coverage of two-hybrid protein interaction maps. PLoS Comput Biol, 3(11):1–20, 2007.
23. E. Ihler, D. Wagner, and F. Wagner. Modeling hypergraphs by graphs with the same mincut properties. Inform Process Lett, 45(4):171–175, 1993.
24. S. Jain, M. White, and P. Radivojac. Estimating the class prior and posterior from noisy positives and unlabeled data. In Proc. 30th Advances in Neural Information Processing Systems, NIPS ’16, pp. 2693–2701, 2016.
25. S. Jain, M. White, and P. Radivojac. Recovering true classifier performance in positive-unlabeled learning. In Proc. 31st AAAI Conference on Artificial Intelligence, AAAI ’17, 2017.
26. S. Jain, M. White, M. W. Trosset, and P. Radivojac. Nonparametric semi-supervised learning of class proportions. arXiv preprint arXiv:1601.01944, 2016.
27. C. Jiang, F. Coenen, and M. Zito. A survey of frequent subgraph mining algorithms. Knowl Eng Rev, 28(01):75–105, 2013.
28. T. Joachims. Learning to classify text using support vector machines: methods, theory, and algorithms. Kluwer Academic Publishers, 2002.
29. S. Klamt, U-U. Haus, and F. Theis. Hypergraphs and cellular networks. PLoS Comput Biol, 5(5):1–6, 2009.
30. D. Koller and N. Friedman. Probabilistic graphical models: Principles and Techniques. MIT Press, 2009.
31. R. I. Kondor and J. D. Lafferty. Diffusion kernels on graphs and other discrete structures. In Proc. 19th International Conference on Machine Learning, ICML ’02, pp. 315–322, 2002.
32. M. Leordeanu and C. Sminchisescu. Efficient hypergraph clustering. In Proc. 15th International Conference on Artificial Intelligence and Statistics, volume 22 of AISTATS ’12, pp. 676–684, 2012.
33. A. C. F. Lewis, N. S. Jones, M. A. Porter, and C. M. Deane. What evidence is there for the homology of protein-protein interactions? PLoS Comput Biol, 8:1–14, 9 2012.
34. D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social networks. J Am Soc Inf Sci Technol, 58(7):1019–1031, 2007.
35. J. Lugo-Martinez and P. Radivojac. Generalized graphlet kernels for probabilistic inference in sparse graphs. Network Science, 2(2):254–276, 2014.
36. S. Martin, D. Roe, and J-L. Faulon. Predicting protein-protein interactions using signature products. Bioinformatics, 21(2):218–226, 2005.
37. A. K. Menon, B. van Rooyen, C. S. Ong, and R. C. Williamson. Learning from corrupted binary labels via class-probability estimation. In Proc. 32nd International Conference on Machine Learning, ICML ’15, pp. 125–134, 2015.
38. R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon. Network motifs: simple building blocks of complex networks. Science, 298:824–827, 2002.
39. Y. Moreau and L. C. Tranchevent. Computational tools for prioritizing candidate genes: boosting disease gene discovery. Nat Rev Genet, 13(8):523–536, 2012.
40. E. Nabieva, K. Jim, A. Agarwal, B. Chazelle, and M. Singh. Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics, 21(Suppl 1):i302–i310, 2005.
41. J. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in Large Margin Classifiers, MIT Press, pp. 61–74, 2000.
42. G. Pólya. Kombinatorische anzahlbestimmungen für gruppen, graphen und chemische verbindungen. Acta Math, 68:145–254, 1937.
43. N. Przulj. Biological network comparison using graphlet degree distribution. Bioinformatics, 23(2):e177–e183, 2007.
44. N. Przulj, D. G. Corneil, and I. Jurisica. Modeling interactome: scale-free or geometric? Bioinformatics, 20(18):3508–3515, 2004.
45. P. Purkait, T-J. Chin, H. Ackermann, and D. Suter. Clustering with hypergraphs: the case for large hyperedges. In Proc. 13th European Conference on Computer Vision, ECCV ’14, pp. 672–687, 2014.
46. J. Qiu and W. S. Noble. Predicting co-complexed protein pairs from heterogeneous data. PLoS Comput Biol, 4(4):e1000054, 2008.
47. A. K. Ramani and E. M. Marcotte. Exploiting the co-evolution of interacting proteins to discover interaction specificity. J Mol Biol, 327(1):273–284, 2003.
48. H. G. Ramaswamy, C. Scott, and A. Tewari. Mixture proportion estimation via kernel embedding of distributions. arXiv preprint arXiv:1603.02501, 2016.
49. K. Rieck and P. Laskov. Linear-time computation of similarity measures for sequential data. J Mach Learn Res, 9:23–48, 2008.
50. R. Sharan, I. Ulitsky, and R. Shamir. Network-based prediction of protein function. Mol Syst Biol, 3:88, 2007.
51. J. Shawe-Taylor and N. Cristianini. Kernel methods for pattern analysis. Cambridge University Press, Cambridge CB2 8RU, UK, 4th edition, 2001.
52. N. Shervashidze, S. V. N. Vishwanathan, T. H. Petri, K. Mehlhorn, and K. M. Borgwardt. Efficient graphlet kernels for large graph comparison. In Proc. 12th International Conference on Artificial Intelligence and Statistics, AISTATS ’09, pp. 488–495, 2009.
53. M. P. H. Stumpf, T. Thorne, E. de Silva, R. Stewart, H. J. An, M. Lappe, and C. Wiuf. Estimating the size of the human interactome. Proc Natl Acad Sci USA, 105(19):6959–6964, 2008.
54. L. Sun, S. Ji, and J. Ye. Hypergraph spectral learning for multi-label classification. In Proc. 14th International Conference on Knowledge Discovery and Data Mining, KDD ’08, pp. 668–676, 2008.
55. K. Tsuda and H. Saigo. Graph classification. In Managing and Mining Graph Data, volume 40 of Advances in Database Systems, pp. 337–363, 2010.
56. V. Vacic, L. M. Iakoucheva, S. Lonardi, and P. Radivojac. Graphlet kernels for prediction of functional residues in protein structures. J Comput Biol, 17(1):55–72, 2010.
57. K. Venkatesan et al. An empirical framework for binary interactome mapping. Nat Methods, 6:83–90, 2009.
58. S. V. N. Vishwanathan, N. N. Schraudolph, R. I. Kondor, and K. M. Borgwardt. Graph kernels. J Mach Learn Res, 11:1201–1242, 2010.
59. C. von Mering, R. Krause, R. I. Kondor, B. Snel, M. Cornell, S. G. Oliver, S. Fields, and P. Bork. Comparative assessment of large-scale data sets of protein-protein interactions. Nature, 417(6887):399–403, 2002.
60. G. Wachman and R. Khardon. Learning from interpretations: a rooted kernel for ordered hypergraphs. In Proc. 24th International Conference on Machine Learning, ICML ’07, pp. 943–950, 2007.
61. Y. Wang and J. Zeng. Predicting drug-target interactions using restricted boltzmann machines. Bioinformatics, 29(13):i126, 2013.
62. J. Xu and Y. Li. Discovering disease-genes by topological features in human protein-protein interaction network. Bioinformatics, 22(22):2800–2805, 2006.
63. Y. Yamanishi, M. Araki, A. Gutteridge, W. Honda, and M. Kanehisa. Prediction of drug-target interaction networks from the integration of chemical and genomic spaces. Bioinformatics, 24(13):i232–i240, 2008.
64. L. V. Zhang, S. L. Wong, O. D. King, and F. P. Roth. Predicting co-complexed protein pairs using genomic and proteomic data integration. BMC bioinformatics, 5(1):38, 2004.
65. D. Zhou, J. Huang, and B. Schölkopf. Learning with hypergraphs: Clustering, classification, and embedding. In Proc. 19th Advances in Neural Information Processing Systems, NIPS ’06, pp. 1601–1608, 2006.
66. X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. In Technical Report CMU-CALD-02-107, Carnegie Mellon University, 2002.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters