Learning over Families of Sets  Hypergraph Representation Learning for Higher Order Tasks
Abstract
Graph representation learning has made major strides over the past decade. However, in many relational domains, the input data are not suited for simple graph representations as the relationships between entities go beyond pairwise interactions. In such cases, the relationships in the data are better represented as hyperedges (set of entities) of a nonuniform hypergraph. While there have been works on principled methods for learning representations of nodes of a hypergraph, these approaches are limited in their applicability to tasks on nonuniform hypergraphs (hyperedges with different cardinalities). In this work, we exploit the incidence structure to develop a hypergraph neural network to learn provably expressive representations of variable sized hyperedges which preserve localisomorphism in the line graph of the hypergraph, while also being invariant to permutations of its constituent vertices. Specifically, for a given vertex set, we propose frameworks for (1) hyperedge classification and (2) variable sized expansion of partially observed hyperedges which captures the higher order interactions among vertices and hyperedges. We evaluate performance on multiple realworld hypergraph datasets and demonstrate consistent, significant improvement in accuracy, over stateoftheart models.
1 Introduction
Deep learning on graphs has been a rapidly evolving field due to its widespread applications in domains such as ecommerce, personalization, fraud & abuse, life sciences, and social network analysis. However, graphs can only capture interactions involving pairs of entities whereas in many of the aforementioned domains any number of entities can participate in a single interaction. For example, more than two substances can interact at a specific instance to form a new compound, study groups can contain more that two students, recipes contain multiple ingredients, shoppers purchase multiple items together, etc. Graphs, therefore, can be an over simplified depiction of the input data (which may result in loss of significant information). Hypergraphs, [7] (see Figure 1(a) for example), which serve as the natural extension of dyadic graphs, form the obvious solution.
Due to the ubiquitous nature of hypergraphs, learning on hypergraphs has been studied for more than a decade [1, 27]. Early works on learning on hypergraphs employed random walk procedures [15, 5, 8] and the vast majority of them were limited to hypergraphs whose hyperedges have the same cardinality (uniform hypergraphs). More recently, with the growing popularity and success of message passing graph neural networks [14, 12], message passing hypergraph neural networks learning frameworks have been proposed [10, 4, 23, 26, 24]. These works rely on constructing the clique expansion (Figure 1(c)), star expansions (Figure 1(d)), or other expansions of the hypergraph that preserve partial information. Subsequently, node representations are learned using GNN’s on the graph constructed as a proxy of the hypergraph. These strategies are insufficient as either (1) there does not exist a bijective transformation between a hypergraph and the constructed clique expansion (loss of information); (2) they do not accurately model the underlying dependency between a hyperedge and its constituent vertices (for example, a hyperedge may cease to exist if one of the nodes were deleted); (3) they do not directly model the interactions between different hyperedges. The primary goal of this work is to address these issues and to build models which better represent hypergraphs.
Corresponding to the adjacency matrix representation of the edge set of a graph, a hypergraph is commonly represented as an incidence matrix (Figure 1(b)), in which a row is a vertex, a column is a hyperedge and an entry in the matrix is 1 if the vertex belongs to the hyperedge. In this work, we directly seek to exploit the incidence structure of the hypergraph to learn representations of nodes and hyperedges. Specifically, for a given partially observed hypergraph, we synchronously learn vertex and hyperedge representations that simultaneously take into consideration both the line graph Figure 1(e) and the set of hyperedges that a vertex belongs to in order to learn provably expressive representations. The jointly learned vertex and hyperedge representations are then used to tackle higherorder tasks such as expansion of partially observed hyperedges and classification of unobserved hyperedges.
While the task of hyperedge classification has been studied before, set expansion for relational data has largely been unexplored. For example, given a partial set of substances which are constituents of a single drug, hyperedge expansion entails completing the set of all constituents of the drug while having access to composition to multiple other drugs. A more detailed example for each of these tasks is presented in the Appendix  Section 7.1. For the hyperedge expansion task, we propose a GAN framework [11] to learn a probability distribution over the vertex power set (conditioned on a partially observed hyperedge), which maximizes the pointwise mutual information between a partially observed hyperedge and other disjoint vertex subsets in the vertex power set.
Our Contributions can be summarized as: (1) Propose a hypergraph neural network which exploits the incidence structure and hence works on real world sparse hypergraphs which have hyperedges of different cardinalities. (2) Provide provably expressive representations of vertices and hyperedges, as well as that of the complete hypergraph which preserves properties of hypergraph isomorphism. (3) Introduce a new task on hypergraphs – namely the variable sized hyperedge expansion and also perform variable sized hyperedge classification. Furthermore, we demonstrate improved performance over existing baselines on majority of the hypergraph datasets using our proposed model.
2 Preliminaries
In our notation henceforth, we shall use capital case characters (e.g., ) to denote a set or a hypergraph, bold capital case characters (e.g., ) to denote a matrix, and capital characters with a right arrow over it (e.g., ) to denote a sequence with a predefined ordering of its elements. We shall use lower characters (e.g., ) to denote the element of a set and bold lower case characters (e.g., ) to denote vectors. Moreover, we shall denote the th row of a matrix with , the th column of the matrix with , and use to denote a subset of the set of size i.e., .
(Hypergraph) Let denote a hypergraph with a finite vertex set , corresponding vertex features , a finite hyperedge set , where and , where denotes the power set on the vertices, the corresponding hyperedge features . We use (termed star of a vertex) to denote the hyperedges incident on a vertex and use , a set of tuples, to denote the family of stars where called the family of stars of H. When explicit vertex and hyperedge features and weights are unavailable, we will consider , , where represents a or vector of ones respectively. The vertex and edge set of a hypergraph can equivalently be represented with an incidence matrix , where if and otherwise. Isomorphic hypergraphs either have the same incidence matrix or a row/column/row and column permutation of the incidence matrix i.e., the matrix is separately exchangeable. We use to denote the line graph (Figure 1(e)) of the hypergraph, use to denote the dual of a hypergraph. Additionally, we define a function , a multivalued function termed line hypergraph of a hypergraph  which generalizes the concepts line graph and the dual of a hypergraph and defines the spectrum of values which lies between them. For the scope of this work, we limit ourselves for to be a dual valued function  using only the two extremes, such that and .
Also, we use to denote the set of all possible attributed hypergraphs with nodes and hyperedges. More formally, is the set of all tuples — for vertex node set size and hyperedge set size .
(1WeisfeilerLeman(1WL) Algorithm) Let be a graph, with a finite vertex set and let be a node coloring function with an arbitrary codomain and denote the color of a node in the graph. Correspondingly, we say a labeled graph is a graph with a complete node coloring . The 1WL Algorithm [3] can then be described as follows: let be a labeled graph and in each iteration, the 1WL computes a node coloring which depends on the coloring from the previous iteration. The coloring of a node in iteration is then computed as where HASH is bijective map between the vertex set and , and denotes the 1hop neighborhood of node in the graph. The 1WL algorithm terminates if the number of colors across two iterations does not change, i.e., the cardinalities of the images of and are equal. The 1WL isomorphism test, is an isomorphism test, where the 1WL algorithm is run in parallel on two graphs and the two graphs are deemed nonisomorphic if a different number of nodes are colored as in .
(Graph Neural Networks (GNNs)) For a graph , modern GNNs use the edge connectivity and node features to learn a representation vector of a node, , or the entire graph, . They employ a neighborhood aggregation strategy, where the representation of a node is iteratively updated by an aggregation of the representations of its neighbors. Multiple layers are employed to capture the khop neighborhood of a node. The update equation of a GNN layer can be written as
where is the representation of node after k layers and is the 1hop neighborhood of in the graph. [17, 22] showed that message passing GNNs are no more powerful than the 1WL algorithm.
(Finite Symmetric Group ) A finite symmetric group is a discrete group defined over a finite set of size symbols (w.l.o.g. we shall consider the set ) and consists of all the permutation operations that can be performed on the symbols. Since the total number of such permutation operations is the order of is m!.
(Group Action (left action)) If is a set and is a group, then is a set if there is a function , denoted by , such that:

for all , where is the identity element of the group

for all and
(Orbit) Given a group acting on a set , the orbit of an element is the set of elements in to which can be moved by the elements of . The orbit of is denoted by .
(Vertex Permutation action ) A vertex permutation action is the application of a left action with the element on a sorted sequence of vertices represented as of a hypergraph to output a corresponding permuted sequence of vertices i.e., . A permutation action can also act on any vector, matrix, or tensor defined over the nodes , e.g., , and output an equivalent vector, matrix, or tensor with the order of the nodes permuted e.g., .
(Hyperedge Permutation Action ) A hyperedge permutation action is the application of a left action with the element on a sorted sequence of hyperedges represented as of a hypergraph to output a corresponding permuted sequence of hyperedges i.e., . A permutation action can also act on any vector, matrix, or tensor defined over the hyperedges , e.g., , and output an equivalent vector, matrix, or tensor with the order of the hyperedges permuted e.g., . It is crucial to note that a vertex permutation action can be simultaneously performed along with the hyperedge permutation. We represent a joint permutation on the entire edge set as , and for a hyperedge as where
(Node Equivalence Class/ Node Isomorphism) The equivalence classes of vertices of a hypergraph under the action of automorphisms between the vertices are called vertex equivalence classes or vertex orbits. If two vertices are in the same vertex orbit, we say they are node isomorphic and are denoted by .
(Hyperedge Orbit/ Isomorphism) The equivalence classes of non empty subsets of vertices of a hypergraph under the action of automorphisms between the subsets are called hyperedge orbits. If two proper subsets are in the same hyperedge orbit, we say they are hyperedge isomorphic and are denoted by .
(Hypergraph Orbit and Isomorphism) The hypergraph orbit of a hypergraph , given by the application of the elements of the finite symmetry group on the vertex set / on the edge set / or any combination of the two and appropriately modifying the associated matrices of the hypergraph. Two hypergraphs and are said to be isomorphic (equivalent) denoted by iff there exists either a vertex permutation action or hyperedge permutation action or both such that . The hypergraph orbits are then the equivalence classes under this relation; two hypergraphs and are equivalent iff their hypergraph orbits are the same.
(invariant functions) A function acting on a hypergraph given by is invariant whenever it is invariant to any vertex permutation/ edge permutation action in the symmetric space i.e., and all isomorphic hypergraphs obtain the same representation. Similarly, a function acting on a hyperedge for a given hypergraph , is said to be invariant iff all isomorphic hyperedges obtain the same representation.
2.1 Problem Setup
Partially observed hyperedge expansion Consider a hypergraph where a small fraction of hyperedges in the hyperedge set are partially observed and let be the completely observed hyperedge set. A partially observed hyperedge implies , but , where is the corresponding completely observed hyperedge of The task here is, given a partial hyperedge but , to complete with vertices from so that after hyperedge expansion .
Unobserved hyperedge classification Consider a hypergraph with an incompletely observed hyperedge set and let be the corresponding completely observed hyperedge set with . An incomplete hyperedge set implies where . It is important to note that in this case, if a certain hyperedge is present in , then the hyperedge is not missing any vertices in the observed hyperedges. The task here is, for a given hypergraph , to predict whether a new hyperedge was present but unobserved in the noisy hyperedge set i.e., but .
3 Learning Framework and Theory
Previous hypergraph neural networks [10, 23, 4], employ a proxy graph to learn vertex representations for every vertex , by aggregating information over its neighborhood. Hyperedge representations (or alternatively, hypergraph) are then obtained, when necessary, by using a pooling operation (e.g. sum, max, mean, etc) over the vertices in the hyperedge (vertex set of the hypergraph). However, such a strategy, fails to (1) preserve properties of equivalence classes of hyperedges/hypergraphs and (2) capture the implicit higher order interactions between the nodes/ hyperedges, and fails on higher order tasks as shown by [19].
To alleviate these issues, in this work, we use a message passing framework on the incidence graph representation of the observed hypergraph, which synchronously updates the node and observed hyperedge embeddings as follows:
(1) 
(2) 
where, denotes vector concatenation, are injective set functions (constructed via [25, 18]) in the layer, are the vector representations of the hyperedge and vertices after layers, are learnable weight matrices and is an elementwise activation function. We use (in practice, =2) to denote the total number of convolutional layers used. From Equation 2 it is clear to see that a vertex not only receives messages from the hyperedges it belongs too, but also from neighboring vertices in the clique expansion. Similarly, from Equation 1, a hyperedge receives messages from its constituent vertices as well as neighboring hyperedges in the line graph of the hypergraph.
However, the above equations, standalone do not present a framework to learn representations of unobserved hyperedges for downstream tasks. In order to do this, post the convolutional layers, the representation of any hyperedge (observed or unobserved) are obtained using a function as:
(3) 
where are injective set, multiset functions respectively, and denotes the model parameters of the entire hypergraph neural network (convolutional layers, set functions). Correspondingly, the representation of the complete hypergraph is obtained using a function as:
(4) 
3.1 Theory
In what follows, we list the properties of the vertex/ hyperedge representations. All proofs are presented in the Supplementary Material. {property}[Vertex Representations] The representation of a vertex in a hypergraph learnt using Equation 2 is a invariant representation where such that = where . Moreover, two vertices which belong to the same vertex equivalence class i.e. obtain the same representation. {property}[Hyperedge Representations] The representation of a hyperedge in a hypergraph learnt using Equation 1 is a invariant representation where such that = where Moreover, two hyperedges which belong to the same hyperedge equivalence class i.e. obtain the same representation. Next, we restate a theorem from [20] which provides a means to deterministically distinguish non isomorphic hypergraphs. Subsequently, we characterize the expressivity of our model to distinguish nonisomorphic hypergraphs.
[[20]] Let be hypergraphs without isolated vertices whose line hypergraphs are isomorphic. Then if and only if there exists a bijection such that where is the family of stars of the hypergraph {theorem} Let , be two non isomorphic hypergraphs with finite vertex and hyperedge sets and no isolated vertices. If the WeisfeilerLehman test of isomorphism decides their line graphs or the star expansions of their duals to be not isomorphic then there exists a function (via Equation 4) and parameters that maps the hypergraphs to different representations. We now, extend this to the expressivity of the hyperedge representations and then show that the property of separate exchangeability [2] of the incidence matrix is preserved by the hypergraph representation. {corollary} There exists a function (via Equation 3) and parameters that maps two nonisomorphic hyperedges to different representations. {remark}[Separate Exchangeability] The representation of a hypergraph learnt using the function (via Equation 4) preserves the separate exchangeability of the incidence structure of the hypergraph.
We now describe the learning procedures for the two tasks, namely variable size hyperedge classification and variable size hyperedge expansion.
3.2 Hyperedge Classification
For a hypergraph , let denote the partially observed hyperedge set in our data corresponding to the true hyperedge set . The goal here is to learn a classifier over the representations of hyperedges (obtained using Equation 3) s.t is used to classify if an unobserved hyperedge exists i.e. but where all for , and is the logistic sigmoid.
Now, for the given hypergraph , let be the target random variables associated with the vertex power set of the graph. Let be the corresponding true values attached to the vertex subsets in the power set, such that iff . We then model the joint distribution of the hyperedges in the hypergraphs by making a mean field assumption as:
(5) 
Subsequently, to learn the model parameters  we make a closed world assumption and treat only the observed hyperedges in as positive and all other edge as false and seek to maximize the loglikelihood.
(6) 
Since the size of vertex power set , grows exponentially with the number of vertices, it is computationally intractable to use all negative hyperedges in the training procedure. Our training procedure, hence employs a negative sampling procedure (in practice, we use 5 distinct negative samples for every hyperedge in every epoch) combined with a cross entropy loss function, to learn the model parameters via backpropagation. This framework can trivially be extended to perform multi class classification on variable sized hyperedges.
3.3 Hyperedge Completion
The set expansion task introduced in [25] makes the infinite deFinetti assumption i.e. the elements of the set are i.i.d. When learning over finite graphs and hypergraphs, this assumption is no longer valid  since the data is relational  i.e. a finite deFinetti [9] assumption is required. Additionally, the partial exchangeability of the structures (adjacency matrix/ incidence matrix) [2] have to be factored in as well.
This raises multiple concerns: (1) computing mutual information of a partial vertex set with all other disjoint vertex subsets in the power set is computationally intractable; (2) to learn a model in the finite deFinetti setting, we need to consider all possible permutations for a vertex subset. For example, under the conditions of finite exchangeability, the pointwise mutual information between two random variables  where both are disjoint elements of the vertex power set (or hyperedges) i.e. is given by:
(7) 
where is a prior and each of cannot be factorized any further i.e.
(8) 
where and denotes the set of all possible permutations of the elements of . The inability to factorize Equation 8 any further, leaves no room for any computational benefits by a strategic addition of vertices  one at a time (i.e. no reuse of computations, whatsoever).
As a solution to this computationally expensive problem, we propose a GAN framework [11] to learn a probability distribution over the vertex power set, conditioned on a partially observed hyperedge, without sacrificing on the underlying objective of maximizing pointwise mutual information between (Equation 7). We describe the working of the generator and the discriminator of the GAN, with the help of a single example below.
Let denote a partially observed hyperedge and denote the representation of the partially observed hyperedge obtained via Equation 3. Let denote the true and predicted vertices respectively to complete the partial hyperedge , where .
Generator()
The goal of the generator is to accurately predict as . We solve this using a twofold strategy  first predict the number of elements , missing in the partially observed hyperedge and then jointly select vertices from . Ideally, the selection of the best vertices should be performed over all vertex subsets of size (where vertices are sampled from without replacement). However, this is computationally intractable even for small values e.g for large graphs with millions of nodes.
We predict the number of elements missing in a hyperedge, , using a function over the representation of the partial hyperedge, . To address the problem of jointly selecting a set of vertices without sacrificing on computational tractability, we seek to employ a variant of the TopK problem often used in computing literature.
The standard topK operation can be adapted to vertices as: given a set of vertices of a graph , to return a vector such that
However a standard topK procedure, which operates by sampling vertices (from the vertex set  a categorical distribution) is discrete and hence not differentiable. To alleviate the issue of differentiability, Gumbel softmax [13, 16] could be employed to provide a differentiable approximation to sampling discrete data. However, explicit topK Gumbel sampling (computing likelihood for all possible sets of size over the complete domain) is computationally prohibitive and hence finds limited applications in hypergraphs with a large number of nodes and hyperedges.
In this work, we sacrifice on differentiability and focus on scalability. We limit the vertex pool (which can complete the hyperedge) to only vertices in the two hop neighborhood (in the clique expansion ) of the vertices in the partial hyperedge. For real world datasets, even the reduced vertex pool consists of a considerable number of vertices  and explicitly computing all sets of size is still prohibitive. In such cases, we sample uniformly at random a large number of distinct vertex subsets of size from the reduced vertex pool, where is the size predicted by the generator. In practice, the large number is typically min(, 100,000), where is the number of vertices in the reduced vertex pool. Subsequently, we compute the inner product of the representations of these subsets (computed using Equation 3) with the representation of the partially observed hyperedge. We then use a simple Top1 to select the set of size which maximizes the inner product.
Discriminator()
The goal of the discriminator is to distinguish the true, but unobserved hyperedges from the others. To do this, we obtain representations of (and similarly for the predicted using the generator ) and employ the discriminator in the same vein as Equation 7. As a surrogate for the logprobablities, we learn a function over the representations of (logprobabilities in higher dimensional space). Following this, we apply a function , as a surrogate for the mutual information computation. The equation of discriminator can then be listed as:
(9) 
and correspondingly for , where is the logistic sigmoid.
Our training procedure for the GAN, over the hypergraph , can then be summarized as follows. Let denote the value function and let denote a set of partial hyperedges and denote the corresponding set with all hyperedges completed. Let denote the corresponding true and predicted vertices to complete the hyperedge. The value function can then be written as:
(10) 
In practice, the model parameters of the GAN are learnt using a cross entropy loss and backpropagation. An MSE loss is employed to train the function , the function that predicts the number of missing vertices in a hyperedge, using ground truth information about the number of missing vertices in the partial hyperedge.
4 Results
We first briefly describe the datasets and then present our experimental results on the two hypergraph tasks.
4.1 Datasets
We use the publicly available hypergraph datasets from [6] to evaluate the proposed models against multiple baselines (described below). We ignore the timestamps in the datasets and only use unique hyperedges which contain greater than 1 vertex. Moreover, none of the datasets have node or hyperedge features. We summarize the dataset statistics in the Supplementary material. We briefly describe the hypergraphs and the hyperedges in the different datasets below.

Online tagging data (tagsmathsx; tagsaskubuntu). In this dataset, nodes are tags (annotations) and a hyperedge is a set of tags for a question on online Stack Exchange forums.

Online thread participation data (threadsmathsx; threadsaskubuntu): Nodes are users and a hyperedge is a set of users answering a question on a forum.

Two drug networks from the National Drug Code Directory, namely (1) NDCclasses: Nodes are class labels and a hyperedge is the set of class labels applied to a drug (all applied at one time) and (2) NDCsubstances: Nodes are substances and a hyperedge is the set of substances in a drug.

US. Congress data (congressbills): Nodes are members of Congress and a hyperedge is the set of members in a committee or cosponsoring a bill.

Email networks (emailEnron; emailEu): Nodes are email addresses and a hyperedge is a set consisting of all recipient addresses on an email along with the sender’s address.

Contact networks (contacthighschool; contactprimaryschool): Nodes are people and a hyperedge is a set of people in close proximity to each other.

Drug use in the Drug Abuse Warning Network (DAWN): Nodes are drugs and a hyperedge is the set of drugs reportedly used by a patient before an emergency department visit.
[b]
Trivial 



Ours  

NDCclasses  0.286  0.614(0.005)  0.657(0.020)  0.676(0.049)  0.768(0.004)  
NDCsubstances  0.286  0.421(0.014)  0.479(0.007)  0.525(0.006)  0.512(0.032)  
DAWN  0.286  0.624(0.010)  0.664(0.006)  0.634(0.003)  0.677(0.004)  
contactprimaryschool  0.286  0.645(0.031)  0.681(0.014)  0.669(0.012)  0.716(0.034)  
contacthighschool  0.286  0.759(0.030)  0.724(0.009)  0.739(0.012)  0.786(0.033)  
tagsmathsx  0.286  0.599(0.009)  0.635(0.003)  0.572(0.003)  0.642(0.006)  
tagsaskubuntu  0.286  0.545(0.005)  0.597(0.007)  0.545(0.006)  0.605(0.002)  
threadsmathsx  0.286  0.453(0.017)  0.553(0.012)  0.487(0.006)  0.586(0.002)  
threadsaskubuntu  0.286  0.425(0.007)  0.512(0.007)  0.464(0.010)  0.488(0.012)  
emailEnron  0.286  0.618(0.032)  0.594(0.046)  0.599(0.040)  0.685(0.016)  
emailEU  0.286  0.664(0.003)  0.651(0.019)  0.661(0.006)  0.687(0.002)  
congressbills  0.286  0.412(0.003)  0.530(0.055)  0.544(0.004)  0.566(0.011) 

A 5fold cross validation procedure is used  numbers outside the parenthesis are the mean values and the standard deviation is specified within the parenthesis

Bold values show maximum empirical average, and multiple bolds happen when its standard deviation overlaps with another average.
[b]
Simple  Recursive  Ours  

NDCclasses  1.207(0.073)  1.163(0.015)  1.107(0.007) 
NDCsubstances  1.167(0.000)  1.161(0.009)  1.153(0.004) 
DAWN  1.213(0.006)  1.197(0.022)  1.088(0.018) 
contactprimaryschool  0.983(0.006)  0.986(0.001)  0.970(0.005) 
contacthighschool  0.990(0.014)  1.000(0.000)  0.989(0.001) 
tagsmathsx  1.012(0.025)  1.003(0.014)  0.982(0.011) 
tagsaskubuntu  1.008(0.003)  1.005(0.003)  0.972(0.001) 
threadsaskubuntu  0.999(0.000)  0.999(0.000)  0.981(0.003) 
emailEnron  1.152(0.045)  1.182(0.015)  1.117(0.049) 
emailEU  1.199(0.002)  1.224(0.010)  1.116(0.013) 
congressbills  1.186(0.004)  1.189(0.001)  1.107(0.004) 

A 5fold cross validation procedure is used  numbers outside the parenthesis are the mean values and the standard deviation is specified within the parenthesis

Bold values show minimum empirical average, and multiple bolds happen when its standard deviation overlaps with another average.
4.2 Experimental Results
Hyperedge Classification In this task, we compare our model against five baselines. The first is a trivial predictor, which always predicts 1 for any hyperedge (in practice, we use 5 negative samples for every real hyperedge). The second two baselines utilize a GCN [14] or GraphSAGE [12] on the clique expansion of the hypergraph. GCN on the clique expansion on the hypergraph is the model proposed by [10] as HGNN. For the fourth baseline, we utilize the star expansion of the hypergraph  and employ a heterogeneous RGCN to learn the vertex, hyperedge embeddings. In each of the baselines, unobserved hyperedge embeddings are obtained by aggregating the representations of the vertices it contains, using a learnable set function [25, 18]. We report F1 scores on the eight datasets in Table 1. More details about the experimental setup is presented in the Supplementary material.
Hyperedge Expansion Due to lack of prior work in hyperedge expansion, here we compare our strategy against two other baselines for hyperedge expansion (with the an identical GAN framework and setup to predict the number of missing vertices, albeit without computing joint representations of predicted vertices) : (1) Addition of TopK vertices, considered independently of each other (2) Recursive addition of Top1 vertex. Since all the three models are able to accurately (close to 100% accuracy) predict the number of missing elements, we introduce normalized set difference, as a statistic to compare the models. Normalized Set difference (permutation invariant) is given by the number of insertion/ deletions/ modifications required to go from the predicted completion to the target completion divided by the number of missing elements in the target completion. For example, let {7,8,9} be a set which we wish to expand. Then the normalized set difference between a predicted completion {3,5,1,4} and target completion {1,2} is computed as by (1+2)/2 = 1.5 (where there is 1 modification and 2 deletions). It is clear to see that, a lower normalized set difference score is better and a score of 0 indicates a perfect set prediction. Results are presented in Table 2.
5 Discussion
In the hyperedge classification task, from Table 1 it is clear to see that our model which with provable expressive properties performs better than the baselines, on most datasets. All three nontrivial baselines appear to suffer from their inability to capture higher order interactions between the vertices in a hyperedge. Moreover, the loss in information by using a proxy graph  in the form of the clique expansion  also affects the performance of the SAGE and GCN baselines. The SAGE baseline obtaining better F1 scores over GCN suggests that the self loop introduced between vertices in the clique expansion appears to hurt performance. The lower scores of the star expansion models can be attributed to its inability in capturing vertexvertex and hyperedgehyperedge interactions.
For the hyperedge expansion task, from Table 2 it is clear to see that adding vertices in a way which captures interactions amongst them performs better than adding vertices independently of each other or in a recursive manner. The relatively weaker performance of adding vertices recursively, one at a time can be attributed to a poor choice of selection of the first vertex to be added (once an unlikely vertex is added, the sequence cannot be corrected).
6 Conclusions
In this work, we developed a hypergraph neural network to learn provably expressive representations of vertices, hyperedges and the complete hypergraph. We proposed frameworks for hyperedge classification and a novel hyperedge expansion task, evaluated performance on multiple realworld hypergraph datasets, and demonstrated consistent, significant improvement in accuracy, over stateoftheart models.
7 Appendix
7.1 Examples:
Let denote the complete set of substances which are possible components in a prescription drug. Now, given a partial set of substances part of a single drug, the hyperedge expansion entails completing the set as with a set of substances from , (which were unobserved due to the data collection procedure for instance), with the set chosen s.t. . On the other hand, an example of a hyperedge classification tasks involves determining whether a certain set of substances can form a valid drug or alternatively classifying the nature of a prescription drug. From the above examples, it is clear to see that the hyperedge expansion and hyperedge classification necessitate the framework to jointly capture dependencies between all the elements of an input set (for instance, the associated outputs in these two tasks, requires us to capture all interactions between a set of substances, rather than just the pairwise interactions between a single substances and its neighbors computed independently  as in node classification) and hence are classed as higher order tasks. Additionally, for the hyperedge expansion task, the associated output is a finite set and hence in addition to maximizing the interactions between the constituent elements it is also required to be permutation invariant. For instance, in the expansion task, the training data as well as the associated target variable to be predicted are both sets. The tasks are further compounded by the fact that the training data and the outputs are both relational i.e. the representation of a vertex/ hyperedge also depends on other sets (composition of other observed drugs) in the family of sets i.e. the data is non i.i.d.
7.2 Proofs of Properties, Remarks and Theorems
We restate the properties, remark and theorems for convenience and prove them.
[Vertex Representations] The representation of a vertex in a hypergraph learnt using Equation 2 is a invariant representation where such that = where . Moreover, two vertices which belong to the same vertex equivalence class i.e. obtain the same representation. {proof} Part 1: Proof by contradiction. Let be two different vertex permutation actions and let . This implies that the same node gets different representations based on an ordering of the vertex set. From eq. 2 it is clear to see that the set function ensures that the vertex representation is not impacted by the edge permutation action. Now let Expanding eq. 2 for both vertex permutation actions and applying the cancellation law of groups, is independent of the permutation action. Since is identical for both, it means the difference arises from the edge permutation action, which is not possible. Now, we can show using induction, the contradiction holds for a certain , then it holds for as well. Hence,
Part2: Proof by contradiction Let be two isomorphic vertices and let This implies However, by the definition, the two vertices are isomorphic, i.e. they have the same initial node features (if available) i.e. and they also posses an isomorphic neighborhood. Equation 2 is deterministic, hence the representations obtained by the vertices are also identical after 1 iteration i.e. . Now using induction we can show that, the representations for is the same as for any Hence when {property}[Hyperedge Representations] The representation of a hyperedge in a hypergraph learnt using Equation 1 is a invariant representation where such that = where Moreover, two hyperedges which belong to the same hyperedge equivalence class i.e. obtain the same representation. {proof} Proof is similar to the two part invariant vertex representation proof given above. Replace the vertex permutation action with a joint vertex edge permutation action and similarly use the cancellation law of groups twice.
[[20]] Let be hypergraphs without isolated vertices whose line hypergraphs are isomorphic. Then if and only if there exists a bijection such that where is the family of stars of the hypergraph
Theorem is a direct restatement of the theorem in the original work. Please refer to [20] for the proof. {theorem} Let , be two non isomorphic hypergraphs with finite vertex and hyperedge sets and no isolated vertices. If the WeisfeilerLehman test of isomorphism decides their line graphs or the star expansions of their duals to be not isomorphic then there exists a function (via Equation 4) and parameters that maps the hypergraphs to different representations. {proof} Part 1: Proof by construction, for the line graph . Consider Equation 1. By construction, make the set function as an injective function with a multiplier of a negligible value i.e. . This implies, a hyperedge only receives information from its adjacent hyperedges. Since we use injective set functions, following the proof of [22] Lemma 2 and Theorem 3, by induction it is easy to see that if the 1WL isomorphism test decides that the line graphs are nonisomorphic, the representations obtained by the hyperedges through the iterative message passing procedure are also different.
Part 2: Proof by construction, for the dual graph Again, consider Equation 1. By construction, associate a unique identifier with every node and hyperedge in the hypergraph. Construct as an identity map, this implies, a hyperedge preserves information from which vertices it receives information as well. Since the above is injective, again following the proof of [22] Lemma 2 and Theorem 3, by induction it is easy to see that if the 1WL isomorphism test decides that the dual of a hypergraph are nonisomorphic, the representations obtained by the hyperedges through the iterative message passing procedure are also different.
Part 3: From part 1 and part 2 of the proof above, we see that if either the line graphs or the dual of the hypergraphs are distinguishable by the 1WL isomorphism test as nonisomorphic then our proposed model is able to detect it as well. From the property of vertex representations it also seen that isomorphic vertices obtain the same representation  hence preserving the family of stars representation as well. Now consider Equation 4 Now, if the line hypergraphs and are distinguishable via the line graphs or the dual graphs then the representation obtained by hyperedge aggregations are different. Correspondingly if the family of stars  does not preserve a bijection across the two hypergraphs, then the representation of the graphs are distinguishable using the vertex aggregation.
There exists a function (via Equation 3) and parameters that map two nonisomorphic hyperedges to different representations. {proof} Proof is a direct consequence of the above theorem, eq. 3 and above property of hyperedges. {remark}[Separate Exchangeability] The representation of a hypergraph learnt using the function (via Equation 4) preserves the separate exchangeability of the incidence structure of the hypergraph. {proof} From Equation 4, it is clear that once the representations of the observed vertices and hyperedges are obtained, the vertex permutation actions don’t affect the edge permutation and vice versa  i.e. the set functions act independently of each other. From Equation 4 and through the use of set functions, it is also clear that the representation of the hypergraph is invariant to permutations of both vertex and edge.
7.3 Dataset Statistics
In Table 3 we list the number of vertices and hyperedges for each of the datasets we have considered.
# Vertices  # Hyperedges  

NDCclasses  1161  679 
NDCsubstances  5556  4916 
contactprimaryschool  242  4036 
contacthighschool  327  1870 
threadsmathsx  201863  177398 
threadsaskubuntu  200974  18785 
emailEnron  148  577 
emailEU  1005  10631 
7.4 Experimental Setup
Our implementation is in PyTorch using Python 3.6. For the hyperedge classification task, we used 5 negative samples for each positive sample. For the hyperedge expansion task, the number of vertices to be added varied from 2 to 7. The implementations for graph neural networks are done using the Deep Graph Library [21]. We used two convolutional layers for all the baselines as well as our model since it had the best performance in our tasks (we had tested with 2/3/4/5 convolutional layers). For all the models, the hidden dimension for the convolutional layers, set functions was tuned from {8,16,32,64}. Optimization is performed with the Adam Optimizer and the learning rate was tuned in {0.1, 0.01, 0.001, 0.0001, 0.00001}. For the set functions we chose from [25] and [18]. For more details refer to the code provided.
References
 (2006) Higher order learning with graphs. In Proceedings of the 23rd international conference on Machine learning, pp. 17–24. Cited by: §1.
 (1981) Representations for partially exchangeable arrays of random variables. Journal of Multivariate Analysis 11 (4), pp. 581–598. Cited by: §3.1, §3.3.
 (1980) Random graph isomorphism. SIaM Journal on computing 9 (3), pp. 628–635. Cited by: §2.
 (2019) Hypergraph convolution and hypergraph attention. arXiv preprint arXiv:1901.08150. Cited by: §1, §3.
 (2013) Random walks in hypergraph. In Proceedings of the 2013 International Conference on Applied Mathematics and Computational Methods, Venice Italy, pp. 187–194. Cited by: §1.
 (2018) Simplicial closure and higherorder link prediction. Proceedings of the National Academy of Sciences 115 (48), pp. E11221–E11230. Cited by: §4.1.
 (1984) Hypergraphs: combinatorics of finite sets. Vol. 45, Elsevier. Cited by: §1.
 (2019) Random walks on hypergraphs with edgedependent vertex weights. arXiv preprint arXiv:1905.08287. Cited by: §1.
 (1977) Finite forms of de finetti’s theorem on exchangeability. Synthese 36 (2), pp. 271–281. Cited by: §3.3.
 (2019) Hypergraph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 3558–3565. Cited by: §1, §3, §4.2.
 (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1, §3.3.
 (2017) Inductive representation learning on large graphs. In NIPS, Cited by: §1, §4.2.
 (2016) Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144. Cited by: §3.3.1.
 (2016) Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1, §4.2.
 (2011) Highordered random walks and generalized laplacians on hypergraphs. In International Workshop on Algorithms and Models for the WebGraph, pp. 14–25. Cited by: §1.
 (2016) The concrete distribution: a continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712. Cited by: §3.3.1.
 (2019) Weisfeiler and leman go neural: higherorder graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4602–4609. Cited by: §2.
 (2018) Janossy pooling: learning deep permutationinvariant functions for variablesize inputs. arXiv preprint arXiv:1811.01900. Cited by: §3, §4.2, §7.4.
 (2020) On the equivalence between positional node embeddings and structural graph representations. In International Conference on Learning Representations, External Links: Link Cited by: §3.
 (1996) Line hypergraphs. Discrete Mathematics 161 (13), pp. 265–283. Cited by: §3.1, §3.1, §7.2, §7.2.
 (2019) Deep graph library: a graphcentric, highlyperformant package for graph neural networks. arXiv preprint arXiv:1909.01315. Cited by: §7.4.
 (2018) How powerful are graph neural networks?. arXiv preprint arXiv:1810.00826. Cited by: §2, §7.2, §7.2.
 (2019) HyperGCN: a new method for training graph convolutional networks on hypergraphs. In Advances in Neural Information Processing Systems, pp. 1509–1520. Cited by: §1, §3.
 (2020) Hypergraph learning with line expansion. arXiv preprint arXiv:2005.04843. Cited by: §1.
 (2017) Deep sets. In Advances in neural information processing systems, pp. 3391–3401. Cited by: §3.3, §3, §4.2, §7.4.
 (2019) Hypersagnn: a selfattention based graph neural network for hypergraphs. arXiv preprint arXiv:1911.02613. Cited by: §1.
 (2007) Learning with hypergraphs: clustering, classification, and embedding. In Advances in neural information processing systems, pp. 1601–1608. Cited by: §1.