Factor Graph Neural Network
Abstract
Most of the successful deep neural network architectures are structured, often consisting of elements like convolutional neural networks and gated recurrent neural networks. Recently, graph neural networks have been successfully applied to graph structured data such as point cloud and molecular data. These networks often only consider pairwise dependencies, as they operate on a graph structure. We generalize the graph neural network into a factor graph neural network (FGNN) in order to capture higher order dependencies. We show that FGNN is able to represent MaxProduct Belief Propagation, an approximate inference algorithm on probabilistic graphical models; hence it is able to do well when MaxProduct does well. Promising results on both synthetic and real datasets demonstrate the effectiveness of the proposed model.
Factor Graph Neural Network
Zhen Zhang Department of Computer Science National University of Singapore zhangz@comp.nus.edu.sg Fan Wu^{†}^{†}thanks: Work was done during the visiting at Department of Computer Science, National University of Singapore. Department of Computer Science Nanjing University, fan01172000@gmail.com Wee Sun Lee Department of Computer Science National University of Singapore leews@comp.nus.edu.sg
noticebox[b]Preprint. Under review.\end@float
1 Introduction
Deep neural networks are powerful approximators that have been extremely successful in practice. While fully connected networks are universal approximators, successful networks in practice tend to be structured, e.g. convolutional neural networks and gated recurrent neural networks such as LSTM and GRU. Convolutional neural networks capture spatial or temporal correlation of neighbouring inputs while recurrent neural networks capture temporal information, retaining information from earlier parts of a sequence. Graph neural networks (see e.g. (Gilmer et al., 2017; Xu et al., 2018)) have recently been successfully used with graph structured data to capture pairwise dependencies between variables and to propagate the information to the entire graph.
Real world data often have higher order dependencies, e.g. atoms satisfy valency constraints on the number of bonds that they can make in a molecule. In this paper, we show that the graph neural network can be extended in a natural way to capture higher order dependencies through the use of the factor graph structure. A factor graph is a bipartite graph with a set of variable nodes connected to a set of factor nodes; each factor node in the graph indicates the presence of dependencies among the variables it is connected to. We call the neural network formed from the factor graph a factor graph neural network (FGNN).
Factor graphs have been used extensively for specifying Probabilistic Graph Models (PGM) which can be used to model dependencies among random variables. Unfortunately, PGMs suffer from scalability issues as inference in PGMs often require solving NPhard problems. Once a PGM has been specified or learned, an approximate inference algorithm, e.g. the SumProduct or MaxProduct Belief Propagation, is often used to infer the values of the target variables (see e.g. (Koller and Friedman, 2009)). Unlike PGMs which usually specify the semantics of the variables being modeled as well as the approximate algorithm being used for inference, graph neural networks usually learn a set of latent variables as well as the inference procedure at the same time from data, normally in an endtoend manner; the graph structure only provides information on the dependencies along which information propagates. For problems where domain knowledge is weak, or where approximate inference algorithms do poorly, being able to learn an inference algorithm jointly with the latent variables, specifically for the target data distribution, often produces superior results.
We take the approach of jointly learning the algorithm and latent variables in developing the factor graph neural network. The FGNN is defined using two types of modules, the VariabletoFactor (VF) module and the FactortoVariable (FV) module, as shown in Figure 1. These modules are combined into a layer, and the layers can be stacked together into an algorithm. We show that the FGNN is able to exactly parameterize the MaxProduct Belief Propagation algorithm, which is widely used in finding approximate maximum a posteriori (MAP) assignment of a PGM. Thus, for situations where belief propagation gives best solutions, the FGNN can mimic the belief propagation procedure. In other cases, doing endtoend learning of the latent variables and the message passing transformations at each layer may result in a better algorithm. Furthermore, we also show that in some special cases, FGNN can be transformed to a particular graph neural network structure, allowing simpler implementation.
We evaluate our model on a synthetic problem with constraints on the number of elements that may be present in subsets of variables to study the strengths of the approach. We then apply the method to two real 3D point cloud problems. We achieve stateoftheart result on one problem while being competitive on the other. The promising results show the effectiveness of the proposed algorithm.
2 Background
2.1 Probabilistic Graphical Model and MAP Inference
Probabilistic Graph Models (PGMs) use graph structures to model dependencies between random variables. These dependencies are conveniently represented using a factor graph, which is a bipartite graph where each vertex in the graph is associated with a random variable , each vertex is associated with a function and there is an edge between variable vertex and function vertex if depends on variable .
Let represent the set of all variables and let represent the subset of variables that depends on. Denote the set of indices of variables in by . We consider probabilistic models for discrete random variables of the form
(1) 
where , are positive functions called potential functions (with , as the corresponding logpotential functions) and is a normalizing constant.
The goal of MAP inference (Koller and Friedman, 2009) is to find the assignment which maximizes , that is
(2) 
The combinatorial optimization problem (2) is NPhard in general, and thus it is often solved using approximate methods. One common method is MaxProduct Belief Propagation, which is an iterative method formulated as
(3) 
Maxproduct type algorithms are fairly effective in practice, achieving moderate accuracy in various problems (Weiss and Freeman, 2001; Felzenszwalb and Huttenlocher, 2006; Globerson and Jaakkola, 2008).
2.2 Related Works
Various graph neural network models have been proposed for graph structured data. These include methods based on the graph Laplacian (Bruna et al., 2013; Defferrard et al., 2016; Kipf and Welling, 2016), using gated networks (Li et al., 2015), and using various other neural networks structures for updating the information (Duvenaud et al., 2015; Battaglia et al., 2016; Kearnes et al., 2016; Schütt et al., 2017). In (Gilmer et al., 2017), it was shown that these methods can be viewed as performing message passing on pairwise graphs and can be generalized to a Message Passing Neural Network (MPNN) architecture. In this work, we seek to go beyond pairwise interactions by using message passing on factor graphs.
The PointNet (Qi et al., 2017) provides permutation invariant functions on a set of points instead of a graph. It propagates information from all nodes to a global feature vector, and allows new node features to be generated by appending the global feature to each node. The main benefit of PointNet is its ability to easily capture global information. However, due to a lack of local information exchange, it may lose the ability to represent local details.
Several works have applied graph neural networks to point cloud data, including the EdgeConv method (Wang et al., 2019) and Point Convolutional Neural Network (PointCNN) (Li et al., 2018). We compare our work with these methods in the experiments. Besides applications to the molecular problems (Battaglia et al., 2016; Duvenaud et al., 2015; Gilmer et al., 2017; Kearnes et al., 2016; Li et al., 2015; Schütt et al., 2017) graph neural networks have also been applied to many other problem domains such as combinatorial optimization (Khalil et al., 2017), point cloud processing (Li et al., 2018; Wang et al., 2019) and binary code similarity detection (Xu et al., 2017).
3 Factor Graph Neural Network
Previous works on graph neural networks focus on learning pairwise information exchanges. The Message Passing Neural Network (MPNN) (Gilmer et al., 2017) provides a framework for deriving different graph neural network algorithms by modifying the message passing operations. We aim at enabling the network to efficiently encode higher order features and to propagate information between higher order factors and the nodes by performing message passing on a factor graph. We describe the FGNN network and show that for specific settings of the network parameters we obtain the MaxProduct Belief Propagation algorithm. Finally, we show that for certain special factor graph structures, FGNN can be represented as a pairwise graph neural network, allowing simpler implementation.
3.1 Factor Graph Neural Network
First we give a brief introduction to the Message Passing Neural Network (MPNN), and then we propose one MPNN architecture which can be easily extended to a factor graph version.
Given a graph , where is a set of nodes and is the adjacency list, assume that each node is associated with a feature vector and each edge with and is associated with an edge feature vector . Then a message passing neural network layer is defined in (Gilmer et al., 2017) as
(4) 
where and are usually parameterized by neural networks. The summation in (4) can be replaced with other aggregation function such as maximization (Wang et al., 2019). The main reason to replace summation is that summation may be corrupted by a single outlier, while the maximization operation is more robust. Thus in our paper we also choose to use the maximization as aggregation function.
There are also multiple choices of the architecture of and . In our paper, we propose a MPNN architecture as follows
(5) 
where maps feature vectors to a length feature vector, and maps to a weight vector. Then by matrix multiplication and aggregation a new feature of length can be generated.
The MPNN encodes unary and pairwise edge features, but higher order features are not directly encoded. Thus we extend the MPNN by introducing extra factor nodes. Given a factor graph , a group of unary features and a group of factor features , assume that for each edge , with , there is an associated edge feature vector . Then, the Factor Graph Neural Network layer on can be extended from (5) as shown in Figure 3.
3.2 FGNN for MaxProduct Belief Propagation
MAP inference over the PGM is NPhard in general, and thus it is often approximately solved by the MaxProduct Belief Propagation method. In this section, we will prove that the MaxProduct Belief Propagation can be exactly parameterized by the FGNN. The sketch of the proof is as follows. First we show that arbitrary higher order potentials can be decomposed as maximization over a set of rank1 tensors, and that the decomposition can be represented by a FGNN layer. After the decomposition, a single MaxProduct Belief Propagation iteration only requires two operations: (1) maximization over rows or columns of a matrix, and (2) summation over a group of features. We show that the two operations can be exactly parameterized by the FGNN and that MaxProduct iterations can be simulated using FGNN layers plus a linear layer at the end.
In the worst case, the size of a potential function grows exponentially with the number of variables that it depends on. In such cases, the size of the FGNN produced by our construction will correspondingly grow exponentially. However, if the potential functions can be well approximated as the maximum of a moderate number of rank1 tensors, the corresponding FGNN will also be of moderate size. In practice, the potential functions may be unknown and only features of the of the factor nodes are provided; FGNN can learn the approximation from data, potentially exploiting regularities such as low rank approximations if they exist.
Tensor Decomposition
For discrete variables , a rank1 tensor is a product of univariate functions of the variables . A tensor can always be decomposed as a finite sum of rank1 tensors Kolda and Bader (2009). This has been used to represent potential functions, e.g. in (Wrigley et al., 2017), in conjunction with sumproduct type inference algorithms. For maxproduct type algorithms, a decomposition as a maximum of a finite number of rank1 tensors is more appropriate. It has been shown in (Kohli and Kumar, 2010) (as stated next) that there is always a finite decomposition of this type.
Lemma 1 ((Kohli and Kumar, 2010)).
Given an arbitrary potential function , there exists a variable which takes a finite number of values and a set of univariate potentials associated with each value of , , s.t.
(6) 
Using ideas from (Kohli and Kumar, 2010), we first show that a PGM with tabular potential functions that can be converted into single layer FGNN with the nonunary potential functions represented as the maximum of a finite number of rank1 tensors.
Proposition 2.
A factor graph with variable log potentials and factor log potentials can be converted into a factor graph with the same variable potentials and the corresponding decomposed factor logpotentials using a onelayer FGNN.
The proof of Proposition 2 and the following two propositions can be found in the supplementary material. With the decomposed higher order potential, one iteration of of the Max Product algorithm (3) can be rewritten using the following two equations:
(7a)  
(7b) 
Given the log potentials represented as a set of rank1 tensors at each factor node, we show that each iteration of the MaxProduct message passing update can be represented by a VariabletoFactor (VF) layer and a FactortoVariable (FV) layer, forming a FGNN layer, followed by a linear layer (that can be absorbed into the VF layer for the next iteration).
With decomposed logpotentials, belief propagation only requires two operations: (1) maximization over rows or columns of a matrix; (2) summation over a group of features.
We first show that the maximization operation in (7a) (producing maxmarginals) can be done using neural networks that can be implemented by the units in the VF layer.
Proposition 3.
For arbitrary real valued feature matrix with as its entry in the row and column, the feature mapping operation can be exactly parameterized with a 2layer neural network with Relu as activation function and at most hidden units.
Following the maximization operations, equation (7a) requires summation of a group of features. However, the VF layer uses max instead of sum operators to aggregate features produced by and operators. Assuming that the operator has performed the maximization component of equation (7a) producing maxmarginals, Proposition 4 shows how the layer can be used to produce a matrix that converts the maxmarginals into an intermediate form to be used with the max aggregators. The output of the max aggregators can then be transformed with a linear layer ( in Proposition 4) to complete the computation of the summation operation required in equation (7a). Hence, equation (7a) can be implemented using the VF layer together with a linear layer that can be absorbed in the operator of the following FV layer.
Proposition 4.
For arbitrary nonnegative valued feature matrix with as its entry in the row and column, there exists a constant tensor that can be used to transform into an intermediate representation , such that after maximization operations are done to obtain , we can use another constant matrix to obtain
(8) 
Equation (7b) can be implemented in the same way as equation (7a) by the FV layer. First the max operations are done by the units to obtain maxmarginals. The maxmarginals are then transformed into an intermediate form using the units which are further transformed by the max aggregators. An additional linear layer is then sufficient to complete the summation operation required in equation (7b). The final linear layer can be absorbed into the next FGNN layer, or as an additional linear layer in the network in the case of the final MaxProduct iteration.
We have demonstrated that we can use FGNN layers and an additional linear layer to parameterize iterations of max product belief propagation. Combined with Proposition 2 which shows that we can use one layer of FGNN to generate tensor decomposition, we have the following corollary.
Corollary 5.
The maxproduct proposition in (3) can be exactly parameterized by the FGNN.
Transformation into Graph Neural Network
For graph structures where there exists a perfect matching in the FactorVariable bipartite graph, i.e. there exists an invertible function , s.t. a FGNN layer can be implemented as a MPNN layer by stacking the variable feature and factor feature as follows^{1}^{1}1More details and derivations are provided in the supplementary file.,
(9) 
where the neighborhood relation is defined as The transformation may connect unrelated factors and nodes together (i.e. a tuple s.t. ). We can add additional gating functions to remove irrelevant connections, or view the additional connections as providing useful additional approximation capabilities for the network. Various graph structures satisfy the above constraints in practice. Figure 4 shows an example of the factor graph transformation. Another example is the graph structure in point cloud segmentation proposed by Wang et al. (2019), where for each point, a factor is defined to include the point and its nearest neighbor.
4 Experiment
In this section, we evaluate the models constructed using FGNN for two types of tasks: MAP inference over higher order PGMs, and point cloud segmentation.
4.1 MAP Inference over PGMs
Data
We construct three synthetic datasets for this experiment in the following manner. We start with a chain structure with length 30 where all nodes take binary states. The node potentials are all randomly generated from the uniform distribution over . We use pairwise potentials that encourage two adjacent nodes to take state , i.e. the potential function gives high value to configuration and low value to all others. In the first dataset, the pairwise potentials are fixed, while in the other two datasets, they are randomly generated. We then add the budget higher order potential (Martins et al., 2015) at every node; these potentials allow at most of the 8 variables that are within their scope to take the state . For the first two datasets, the value is set to and in the third dataset, it is set randomly. Parameters that are not fixed are provided as input factor features; complete description of the datasets is included in the supplementary material. Once the data item is generated, we use a branchandbound solver (Martins et al., 2015) to find the exact MAP solution, and use it as label to train our model.
We test the ability of the proposed model to find the MAP solutions, and compare the results with other graph neural network based methods including PointNet(Qi et al., 2017) and EdgeConv (Wang et al., 2019) as well as several specific MAP inference solver including AD3 (Martins et al., 2015) which solves a linear programming relaxation of the problem, and MaxProduct Belief Propagation (Weiss and Freeman, 2001), implemented by (Mooij, 2010). Both AD3 and MaxProduct are approximate inference algorithms and are run with the correct model for each instance.
Architecture and training details
We use a multilayer factor graph neural network with architecture FGNN(64)  Res[FC(64)  FGNN(64)  FC(64)]  MLP(128)  Res[FC(64)  FGNN(64)  FC(128)]  FC(256)  Res[FC(64)  FGNN(64)  FC(256)]  FC(128)  Res[FC(64)  FGNN(64)  FC(128)]  FC(64)  Res[FC(64)  FGNN(64)  FC(64)]  FGNN(2). Here one FGNN() is a FGNN layer with as output feature dimension with ReLU (Nair and Hinton, 2010) as activation. One FC() is a fully connect layer with as output feature dimension and ReLU as activation. Res[] is a neural network with residual link from its input to output (He et al., 2016).
The model is implemented using pytorch (Paszke et al., 2017) trained with Adam optimizer with initial learning rate and after each epoch, lr is decreased by a factor of ^{2}^{2}2Code is at https://github.com/zzhang1987/FactorGraphNeuralNetwork. For PointNet and EdgeConvare trained using their recommended hyper parameter (for point cloud segmentation problems). For all the models listed in Table 1, we train for epoches after which all models achieve convergence.
Results
We compare the prediction of each method with that of the exact MAP solution. The percentage agreement (including the mean and standard deviation) are provided in Table 1. Our model achieves far better result on both Dataset1 and Dataset2 compared with all other methods. The performance on Dataset3 is also comparable to that of the LPrelaxation.
There is no comparison with PointNet(Qi et al., 2017) and DGCNN(Wang et al., 2019) on Dataset2 and Dataset3 because it is generally difficult for them to handle edge features associated with random pairwise potential and high order potential. Also, due to the convergence problems of MaxProduct solver on Dataset3, this experiment was not carried out.
MaxProduct performs poorly on this problem. So, in this case, even though it is possible for FGNN to emulate the MaxProduct algorithm, it is better to learn a different inference algorithm.



LP Relaxation  MaxProduct 



Dataset1  42.6 0.007  60.20.017  80.70.025  53.00.101  92.50.020  
Dataset2  –  –  83.80.024  54.20.164  89.10.017  
Dataset3  –  –  88.20.011  –  87.7 0.013 
We have used FGNN factor nodes that depend on 8 neighboring variable nodes to take advantage of known dependencies. We also did a small ablation study on the size of the high order potential functions on Dataset1. The resulting accuracies are 81.7 and 89.9 when 4 and 6 variables are used instead. This shows that knowing the correct size of the potential function can give advantage.
4.2 Point Cloud Segmentation
Data
We use the Stanford LargeScale 3D indoor Spaces Dataset (S3DIS) (Armeni et al., 2016) for semantic segmentation and the ShapeNet part dataset (Yi et al., 2016) for part segmentation.
The S3DIS dataset includes 3D scan point clouds for 6 indoor areas including 272 rooms labeled with 13 semantic categories. We follow the same setting as Qi et al. (2017) and Wang et al. (2019), where each room is split into blocks with size 1 meter 1 meter, and each point is associated with a 9D vector (absolute spatial coordinates, RGB color, and normalized spatial coordinates). For each block, 4096 points are sampled during training phase, and all points are used in the testing phase. In the experiment we use the same 6fold cross validation setting as Qi et al. (2017); Wang et al. (2019).
The ShapeNet part dataset contains 16,811 3D shapes from 16 categories, annotated with 50 parts. From each shape 2048 points are sampled and most data items are with less than six parts. In the part segmentation, we follow the official train/validation/test split provided by Chang et al. (2015).
Architecture and training details
For both semantic segmentation and part segmentation, for each point in the point cloud data, we define a factor which include the point and its nearest neighbor and use the feature of that point as factor feature, and the edge feature is computed as . In our network, we set . A nice property of such factor construction procedure is that it is easy to find a perfect matching in the FactorVariable bipartite graph, and then the FGNN can be efficiently implemented as MPNN.
Then a multilayer FGNN with the architecture Input  FGNN(64)  FC(128)  Res[FC(64)  FGNN(64)  FC(128)]  FC(256)  Res[FC(64)  FGNN(64)  FC(256)]  FC(512)  Res[FC(64)  FGNN(64)  FC(512)]  FC(512)  GlobalPooling  FC(512)  Res[FC(64)  FGNN(64)  FC(512)]  FC(256)  DR(0.5)  FC(256)  DR(0.5)  FC(128)  FC(), where FGNN, FC and Res are the same as previous section. DR() is the dropout layer with dropout rate , and the global pooling layer (GlobalPooling) as the same as the global pooling in PointNet, and is the number of classes.
Our model is trained with the Adam optimizer with initial learning rate , and after each epoch, the learning rate is decreased by a factor of . For semantic segmentation, we train the model with 100 epoches and for part segmentation with batch size 8, and we train model for 200 epoches with batch size 8 for part segmentation on a single NVIDIA RTX 2080Ti card. In the experiment, we strictly follow the same protocol as Qi et al. (2017); Wang et al. (2019) for fair comparison. More details on the experiments are provided in the supplementary files.
Results
For both tasks, we use IntersectionoverUnion (IoU) on points to evaluate the performance of different models. We strictly follow the evaluation scheme of Qi et al. (2017); Wang et al. (2019). The quantitative results of semantic segmentation is shown in Table 2 and the quantitative results of part segmentation is shown in Table 3. In semantic segmentation our algorithm attains the best performance while on part segmentation, our algorithm attains comparable performance with the other algorithms. These results demonstrate the utility of FGNN on real tasks.








Mean IoU  20.1  47.6  56.1  57.3  60.0  
Overall Accuracy  53.2  81.1  84.1  84.5  85.5 




















PointNet  83.7  83.4  78.7  82.5  74.9  89.6  73.0  91.5  85.9  80.8  95.3  65.2  93.0  81.2  57.9  72.8  80.6  
DGCNN  85.1  84.2  83.7  84.4  77.1  90.9  78.5  91.5  87.3  82.9  96.0  67.8  93.3  82.6  59.7  75.5  82.0  
PointCNN ^{3}^{3}3The PointCNN model is trained with the same scheme except for data augmentation. During the training of PointCNN, coordinate data are augmented by adding a zeromean Gaussian noise with variance .  86.1  84.1  86.5  86.0  80.8  90.6  79.7  92.3  88.4  85.3  96.1  77.2  95.3  84.2  64.2  80.0  83.0  
Ours  84.7  84.7  84.0  86.1  78.2  90.8  70.4  90.8  88.7  82.4  95.5  70.6  94.7  81.0  56.8  75.3  80.5 
5 Conclusion
We extend graph neural networks to factor graph neural networks, enabling the network to capture higher order dependencies among the variables. The factor graph neural networks can represent the execution of the MaxProduct Belief Propagation algorithm on probabilistic graphical models, allowing it to do well when MaxProduct does well; at the same time, it has the potential to learn better inference algorithms from data when MaxProduct fails. Experiments on a synthetic dataset and two real datasets show that the method gives promising performance.
The ability to capture arbitrary dependencies opens up new opportunities for adding structural bias into learning and inference problems. The relationship to graphical model inference through the MaxProduct algorithm provides a guide on how knowledge on dependencies can be added into the factor graph neural networks.
References
 Armeni et al. [2016] Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese. 3d semantic parsing of largescale indoor spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1534–1543, 2016.
 Battaglia et al. [2016] Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. Interaction networks for learning about objects, relations and physics. In Advances in neural information processing systems, pages 4502–4510, 2016.
 Bruna et al. [2013] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.
 Chang et al. [2015] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An informationrich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
 Defferrard et al. [2016] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pages 3844–3852, 2016.
 Duvenaud et al. [2015] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán AspuruGuzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pages 2224–2232, 2015.
 Felzenszwalb and Huttenlocher [2006] Pedro F Felzenszwalb and Daniel P Huttenlocher. Efficient belief propagation for early vision. International journal of computer vision, 70(1):41–54, 2006.
 Gilmer et al. [2017] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1263–1272. JMLR. org, 2017.
 Globerson and Jaakkola [2008] Amir Globerson and Tommi S Jaakkola. Fixing maxproduct: Convergent message passing algorithms for map lprelaxations. In Advances in neural information processing systems, pages 553–560, 2008.
 He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 Kearnes et al. [2016] Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, and Patrick Riley. Molecular graph convolutions: moving beyond fingerprints. Journal of computeraided molecular design, 30(8):595–608, 2016.
 Khalil et al. [2017] Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial optimization algorithms over graphs. In Advances in Neural Information Processing Systems, pages 6348–6358, 2017.
 Kipf and Welling [2016] Thomas N Kipf and Max Welling. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
 Kohli and Kumar [2010] Pushmeet Kohli and M Pawan Kumar. Energy minimization for linear envelope mrfs. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 1863–1870. IEEE, 2010.
 Kolda and Bader [2009] Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review, 51(3):455–500, 2009.
 Koller and Friedman [2009] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.
 Li et al. [2018] Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. PointCNN: Convolution on Transformed Points. In Advances in Neural Information Processing Systems, pages 820–830, 2018.
 Li et al. [2015] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493, 2015.
 Martins et al. [2015] André FT Martins, Mário AT Figueiredo, Pedro MQ Aguiar, Noah A Smith, and Eric P Xing. AD3: Alternating directions dual decomposition for map inference in graphical models. The Journal of Machine Learning Research, 16(1):495–545, 2015.
 Mooij [2010] Joris M Mooij. libdai: A free and open source c++ library for discrete approximate inference in graphical models. Journal of Machine Learning Research, 11(Aug):2169–2173, 2010.
 Nair and Hinton [2010] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML10), pages 807–814, 2010.
 Paszke et al. [2017] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
 Qi et al. [2017] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 652–660, 2017.
 Schütt et al. [2017] Kristof T Schütt, Farhad Arbabzadah, Stefan Chmiela, Klaus R Müller, and Alexandre Tkatchenko. Quantumchemical insights from deep tensor neural networks. Nature communications, 8:13890, 2017.
 Wang et al. [2019] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E. Sarma, Michael M. Bronstein, and Justin M. Solomon. Dynamic Graph CNN for Learning on Point Clouds. ACM Transactions on Graphics (TOG), 2019.
 Weiss and Freeman [2001] Yair Weiss and William T Freeman. On the optimality of solutions of the maxproduct beliefpropagation algorithm in arbitrary graphs. IEEE Transactions on Information Theory, 47(2):736–744, 2001.
 Wrigley et al. [2017] Andrew Wrigley, Wee Sun Lee, and Nan Ye. Tensor belief propagation. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3771–3779, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/wrigley17a.html.
 Xu et al. [2018] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826, 2018.
 Xu et al. [2017] Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. Neural networkbased graph embedding for crossplatform binary code similarity detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 363–376. ACM, 2017.
 Yi et al. [2016] Li Yi, Vladimir G Kim, Duygu Ceylan, I Shen, Mengyan Yan, Hao Su, Cewu Lu, Qixing Huang, Alla Sheffer, Leonidas Guibas, et al. A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics (TOG), 35(6):210, 2016.
Factor Graph Neural Net—Supplementary File
Appendix A Proof of propositions
Lemma 6.
Given nonnegative feature vectors , where , there exists matrices with shape and vector , s.t.
Proof.
Let
(10) 
then we have that
By the fact that all feature vectors are nonnegative, obviously we have that . ∎
Lemma (6) suggests that for a group of feature vectors, we can use the operator to produce several matrices to map different vector to different subspaces of a highdimensional spaces, and then our maximization aggregation can sufficiently gather information from the feature groups.
Proposition 2.
A factor graph with variable log potentials and factor log potentials can be converted into a factor graph with the same variable potentials and the corresponding decomposed factor logpotentials using a onelayer FGNN.
Proof.
Without loss of generality, we assume that . Then let
(11) 
where can be arbitrary real number which is larger than . Obviously we will have
(12) 
Assume that we have a factor , and each nodes can take states. Then can be sorted as
and the higher order potential can be organized as vector . Then for each the item in (11) have entries, and each entry is either a scaled entry of the vector or arbitrary negative number less than .
Thus if we organize as a length vector , then we define a matrix , where if and only if the entry of is set to the entry of multiplied by , the entry of in row, column will be set to ; all the other entries of is set to some negative number smaller than . Due to the assumption that , the matrix multiplication must produce a legal .
If we directly define a network which produces the above matrices , then in the aggregating part of our network there might be information loss. However, by Lemma 6 there must exists a group of such that the maximization aggregation over features will produce exactly a vector representation of . Thus if every is a different onehot vector, we can easily using one single linear layer network to produce all , and with a network which always output factor feature, we are able to output a vector representation of at each factor node . ∎
Given the log potentials represented as a set of rank1 tensors at each factor node, we need to show that each iteration of the Max Product message passing update can be represented by a VariabletoFactor layer followed by a FactortoVariable layer (forming a FGNN layer). We reproduce the update equations here.
(13a)  
(13b) 
In the maxproduct updating procedure, we should keep all decomposed and all unary potential . That requires the FGNN to have the ability to fitting identity mapping. Obviously, let the net always output identity matrix, let the always output , and let always output , then the FGNN will be an identity mapping. As always output a matrix and output a vector, we can use part of their blocks as identify mapping to keep and . The other blocks are used to updating and .
First we show that operators in the VariabletoFactor layer can be used to construct the computational graph for the maxmarginal operations.
Proposition 3.
For arbitrary real valued feature matrix with as its entry in the row and column, the feature mapping operation can be exactly parameterized with a 2layer neural network with Relu as activation function and at most hidden units.
Proof.
Without loss of generality we assume that , and then we use to denote . When , it is obvious that
and the maximization can be parameterized by a two layer neural network with 3 hidden units, which satisfied the proposition.
Assume that when , the proposition is satisfied. Then for , we can find and using two network with layers and at most hidden units. Stacking the two neural network together would results in a network with layers and at most . Then we can add another 2 layer network with 3 hidden units to find . Thus by mathematical induction the proposition is proved. ∎
The update equations contain summations of columns of a matrix after the maxmarginal operations. However, the VF and FV layers use max operators to aggregate features produced by and operator. Assume that the operator has produced the maxmarginals, then we use the to produce several weight matrix. The maxmarginals are multiplied by the weight matrices to produce new feature vectors, and the maximization aggregating function are used to aggregating information from the new feature vectors. We use the following propagation to show that the summations of maxmarginals can be implemented by one MPNN layer plus one linear layer. Thus we can use the VF layer plus a linear layer to produce and use the FV layer plus another linear layer to produce . Hence to do iterations of Max Product, we need FGNN layers followed by a linear layer.
Proposition 4.
For an arbitrary nonnegative valued feature matrix with as its entry in the row and column, there exists a constant tensor that can be used to transform into an intermediate representation , such that after maximization operations are done to obtain , we can use another constant matrix to obtain
(14) 
Proof.
In Lemma 6 and Proposition 4, only nonnegative features are considered, while in logpotentials, there can be negative entries. However, for the MAP inference problem in (2), the transformation as follows would make the logpotentials nonnegative without changing the final MAP assignment,
(15) 
As a result, for arbitary PGM we can first apply the above transformation to make the logpotentials nonnegative, and then our FGNN can exactly do MaxProduct Belief Propagation on the transformed nonnegative logpotentials.
Transformation to Graph Neural Network
The factor graph is a bipartite graph among factors and nodes. If there is a perfect matching between factors and nodes, we can use the perfect matching between factors and nodes to transform the FGNN into a MPNN. Assume that we have a parameterized FGNN as follows
(16) 
When the perfect matching exists, there must exist an invertible function which maps a node to a factor . Then for each , we can pack the feature together to get a supernode .
Then we construct the edges between the super nodes. In the FGNN (16), a node will exchange information with all such that . Thus the supernode has to communicate with super nodes such that . On the other hand, the factor will communicate with all such that , and thus the supernode has to communicate with super nodes such that . Upon these constraints, the neighbors of a supernode is defined as
(17) 
As is a onetoone matching function, the super node can be uniquely determined by , thus we can use to represent .
The edge list may create link between unrelated node and factor (i.e. the node and factor do not have intersection). Thus for each and we can create a tag which equals 1 if and 0 otherwise. Without loss of generality assume and net produces positive features, with the tag we are able to define an equivalent MPNN of (16) as follows:
Furthermore, we can put the tag to the edge feature and let the neural network learn to reject unrelated cluster and node, and thus the above MPNN becomes (9).
Appendix B Additional Information on MAP Inference over PGM
We construct three datasets. All variables are binary. The instances start with a chain structure with unary potential on every node and pairwise potentials between consecutive nodes. A higher order potential is then added to every node.
The node potentials are all randomly generated from the uniform distribution over . We use pairwise potentials that encourage two adjacent nodes to take state , i.e. the potential function give high value to configuration and low value to all other configurations. The detailed setting for pairwise potential can be found in Table 4 and Table 5. For example, in Dataset1, the potential value for to take the state 0 and to take the state 1 is 0.2; in Dataset2 and Dataset3, the potential value for and to take the state 1 at the same time is sampled from a uniform distribution over [0, 2].






0  0  

0  U[0,2] 
We then add the budget higher order potential (Martins et al., 2015) at every node; these potentials allow at most of the 8 variables that are within their scope to take the state 1. For the first two datasets, the value is set to 5; for the third dataset, it is set to a random integer in {1,2,3,4,5,6,7,8}.
As a result of the constructions, different datasets have different inputs for the FGNN; for each dataset, the inputs for each instance are the parameters of the PGM that are not fixed. For Dataset1, only the node potentials are not fixed, hence each input instance is a factor graph with the randomly generated node potential added as the input node feature for each variable node. For Dataset2, randomly generate node potentials are used as variable node features while randomly generated pairwise potential parameters are used as the corresponding pairwise factor node features. Finally, for Dataset3, the variable nodes, the pairwise factor nodes and the high order factor nodes all have corresponding input features.
Appendix C Extra Information On Point Cloud Segmentation
In the point cloud segmentation experiment, there are various factors which may affects the final performance. One of the most critical part is data sampling. In both Shapenet dataset and S3DIS dataset, it is required to sample a point cloud from either a CAD model or a indoor scene block. Thus for fair comparison, all the methods are trained on the dataset sampled from original Shapenet dataset and S3Dis dataset by Qi et al. (2017) et al., and following Qi et al. (2017) and Wang et al. (2019), we do not apply any data augmentation during training. When training on Shapenet dataset, there is an official train/val/test split. We do training on the training set, and do validation after training on one epoch, and then use the model with best validation performance for evaluation on test set. For S3DIS dataset, as we are doing 6fold cross validation, we simply run 100 epochs for each fold and do performance evaluation on the model from last epoch. The detailed comparison on the IoU of each classes are in Table 6, where for PointNet and EdgeConv, we directly use the results from Qi et al. (2017); Wang et al. (2019) since we are following exactly their experiment protocal. Wang et al. (2019) did not provide the detailed IoU of each class in their paper. For PointCNN(Li et al., 2018), we rerun the experiment with exactly the same protocal as others.
Method  OA  mIoU  ceiling  floor  wall  beam  column  window  door  table  chair  sofa  bookcase  board  clutter 
PointNet  78.5  47.6  88.0  88.7  69.3  42.4  23.1  47.5  51.6  54.1  42.0  9.6  38.2  29.4  35.2 
EdgeConv  84.4  56.1  –  –  –  –  –  –  –  –  –  –  –  –  – 
PointCNN  84.5  57.3  92.0  93.2  76.0  46.1  23.6  43.8  56.2  67.5  64.5  30.0  52.1  49.0  50.8 
Ours  85.5  60.0  93.0  95.3  78.3  59.8  38.3  55.4  61.2  64.5  57.7  26.7  50.0  49.1  50.3 