DFNets: Spectral CNNs for Graphs with Feedback-Looped Filters

DFNets: Spectral CNNs for Graphs with Feedback-Looped Filters

Asiri  Wijesinghe
Research School of Computer Science
The Australian National University
asiri.wijesinghe@anu.edu.au
&Qing  Wang
Research School of Computer Science
The Australian National University
qing.wang@anu.edu.au
Abstract

We propose a novel spectral convolutional neural network (CNN) model on graph structured data, namely Distributed Feedback-Looped Networks (DFNets). This model is incorporated with a robust class of spectral graph filters, called feedback-looped filters, to provide better localization on vertices, while still attaining fast convergence and linear memory requirements. Theoretically, feedback-looped filters can guarantee convergence w.r.t. a specified error bound, and be applied universally to any graph without knowing its structure. Furthermore, the propagation rule of this model can diversify features from the preceding layers to produce strong gradient flows. We have evaluated our model using two benchmark tasks: semi-supervised document classification on citation networks and semi-supervised entity classification on a knowledge graph. The experimental results show that our model considerably outperforms the state-of-the-art methods in both benchmark tasks over all datasets.

1 Introduction

Convolutional neural networks (CNNs) Krizhevsky et al. (2012) are a powerful deep learning approach which has been widely applied in various fields, e.g., object recognition Sharif Razavian et al. (2014), image classification Hu et al. (2015), and semantic segmentation Li et al. (2017). Traditionally, CNNs only deal with data that has a regular Euclidean structure, such as images, videos and text. In recent years, due to the rising trends in network analysis and prediction, generalizing CNNs to graphs has attracted considerable interest Bruna et al. (2013); Defferrard et al. (2016); Hamilton et al. (2017); Perozzi et al. (2014). However, since graphs are in irregular non-Euclidean domains, this brings up the challenge of how to enhance CNNs for effectively extracting useful features (e.g. topological structure) from arbitrary graphs.

To address this challenge, a number of studies have been devoted to enhancing CNNs by developing filters over graphs. In general, there are two categories of graph filters: (a) spatial graph filters, and (b) spectral graph filters. Spatial graph filters are defined as convolutions directly on graphs, which consider neighbors that are spatially close to a current vertex Atwood and Towsley (2016); Duvenaud et al. (2015); Hamilton et al. (2017). In contrast, spectral graph filters are convolutions indirectly defined on graphs, through their spectral representations Bruna et al. (2013); Chung and Graham (1997); Defferrard et al. (2016). In this paper, we follow the line of previous studies in developing spectral graph filters and tackle the problem of designing an effective, yet efficient CNNs with spectral graph filters.

Previously, Bruna et al. Bruna et al. (2013) proposed convolution operations on graphs via a spectral decomposition of the graph Laplacian. To reduce learning complexity in the setting where the graph structure is not known a priori, Henaff et al. Henaff et al. (2015) developed a spectral filter with smooth coefficients. Then, Defferrard et al. Defferrard et al. (2016) introduced Chebyshev filters to stabilize convolution operations under coefficient perturbation and these filters can be exactly localized in k-hop neighborhood. Later, Kipf et al. Kipf and Welling (2017) proposed a simple layer-wise propagation model using Chebyshev filters on 1-hop neighborhood. Very recently, some works attempted to develop rational polynomial filters, such as Cayley filters Levie et al. (2017) and ARMA Bianchi et al. (2019). From a different perspective, Petar et al. Veličković et al. (2017) proposed a self-attention based CNN architecture for graph filters, which extracts features by considering the importance of neighbors.

Figure 1: A simplified example of illustrating feedback-looped filters, where is the current vertex and the similarity of the colours indicates the correlation between vertices, e.g., and are highly correlated, but and are less correlated with : (a) an input graph, where is the original frequency to vertex ; (b) the feedforward filtering, which attenuates some low order frequencies, e.g. , and amplify other frequencies, e.g. and ; (c) the feedback filtering, which reduces the error in the frequencies generated by (b), e.g. .

One key idea behind existing works on designing spectral graph filters is to approximate the frequency responses of graph filters using a polynomial function (e.g. Chebyshev filters Defferrard et al. (2016)) or a rational polynomial function (e.g. Cayley filters Levie et al. (2017) and ARMA Bianchi et al. (2019)). Polynomial filters are sensitive to changes in the underlying graph structure. They are also very smooth and can hardly model sharp changes, as illustrated in Figure 1. Rational polynomial filters are more powerful to model localization, but they often have to trade off computational efficiency, resulting in higher learning and computational complexities, as well as instability.

Contributions. In this work, we aim to develop a new class of spectral graph filters that can overcome the above limitations. We also propose a spectral CNN architecture (i.e. DFNet) to incorporate these graph filters. In summary, our contributions are as follows:

  • Improved localization. A new class of spectral graph filters, called feedback-looped filters, is proposed to enable better localization, due to its rational polynomial form. Basically, feedback-looped filters consist of two parts: feedforward and feedback. The feedforward filtering is k-localized as polynomial filters, while the feedback filtering is unique which refines k-localized features captured by the feedforward filtering to improve approximation accuracy. We also propose two techniques: scaled-normalization and cut-off frequency to avoid the issues of gradient vanishing/exploding and instabilities.

  • Efficient computation. For feedback-looped filters, we avoid the matrix inversion implied by the denominator through approximating the matrix inversion with a recursion. Thus, benefited from this approximation, feedback-looped filters attain linear convergence time and linear memory requirements w.r.t. the number of edges in a graph.

  • Theoretical properties. Feedback-looped filters enjoy several nice theoretical properties. Unlike other rational polynomial filters for graphs, they have theoretically guaranteed convergence w.r.t. a specified error bound. On the other hand, they still have the universal property as other spectral graph filters Isufi et al. (2017b), i.e., can be applied without knowing the underlying structure of a graph. The optimal coefficients of feedback-looped filters are learnable via an optimization condition for any given graph.

  • Dense architecture. We propose a layer-wise propagation rule for our spectral CNN model with feedback-looped filters, which densely connects layers as in DenseNet Huang et al. (2017). This design enables our model to diversify features from all preceding layers, leading to a strong gradient flow. We also introduce a layer-wise regularization term to alleviate the overfitting issue. In doing so, we can prevent the generation of spurious features and thus improve accuracy of the prediction.

To empirically verify the effectiveness of our work, we have evaluated feedback-looped filters within three different CNN architectures over four benchmark datasets to compare against the state-of-the-art methods. The experimental results show that our models significantly outperform the state-of-the-art methods. We further demonstrate the effectiveness of our model DFNet through the node embeddings in a 2-D space of vertices from two datasets.

2 Spectral Convolution on Graphs

Let be an undirected and weighted graph, where is a set of vertices, is a set of edges, and is an adjacency matrix which encodes the weights of edges. We let and . A graph signal is a function and can be represented as a vector whose component is the value of at the vertex in . The graph Laplacian is defined as , where is a diagonal matrix with and is an identity matrix. has a set of orthogonal eigenvectors , known as the graph Fourier basis, and non-negative eigenvalues , known as the graph frequencies Chung and Graham (1997). is diagonalizable by the eigendecomposition such that , where and is a hermitian transpose of . We use and to denote the smallest and largest eigenvalues of , respectively.

Given a graph signal , the graph Fourier transform of is and its inverse is Sandryhaila and Moura (2013); Shuman et al. (2013). The graph Fourier transform enables us to apply graph filters in the vertex domain. A graph filter can filter by altering (amplifying or attenuating) the graph frequencies as

(1)

Here, , which controls how the frequency of each component in a graph signal is modified. However, applying graph filtering as in Eq. 1 requires the eigendecomposition of , which is computationally expensive. To address this issue, several works Bianchi et al. (2019); Defferrard et al. (2016); Hammond et al. (2011); Kipf and Welling (2017); Levie et al. (2017); Liao et al. (2019) have studied the approximation of by a polynomial or rational polynomial function.

Chebyshev filters. Hammond et al. Hammond et al. (2011) first proposed to approximate by a polynomial function with -order polynomials and Chebyshev coefficients. Later, Defferrard et al. Defferrard et al. (2016) developed Chebyshev filters for spectral CNNs on graphs. A Chebyshev filter is defined as

(2)

where is a vector of learnable Chebyshev coefficients, is rescaled from , the Chebyshev polynomials are recursively defined with and , and controls the size of filters, i.e., localized in k-hop neighborhood of a vertex Hammond et al. (2011). Kipf and Welling Kipf and Welling (2017) simplified Chebyshev filters by restricting to 1-hop neighborhood.

Lanczos filters. Recently, Liao et al. Liao et al. (2019) used the Lanczos algorithm to generate a low-rank matrix approximation for the graph Laplacian. They used the affinity matrix . Since holds, and share the same eigenvectors but have different eigenvalues. As a result, and correspond to the same . To approximate the eigenvectors and eigenvalues of , they diagonalize the tri-diagonal matrix to compute Ritz-vectors and Ritz-values , and thus . Accordingly, a k-hop Lanczos filter operation is,

(3)

where is a vector of learnable Lanczos filter coefficients. Thus, spectral convolutional operation is defined as . Such Lanczos filter operations can significantly reduce computation overhead when approximating large powers of , i.e. . Thus, they can efficiently compute the spectral graph convolution with a very large localization range to easily capture the multi-scale information of the graph.

Cayley filters. Observing that Chebyshev filters have difficulty in detecting narrow frequency bands due to , Levie et al. Levie et al. (2017) proposed Cayley filters, based on Cayley polynomials:

(4)

where is a real coefficient and is a vector of complex coefficients. denotes the real part of a complex number , and is a parameter called spectral zoom, which controls the degree of “zooming” into eigenvalues in . Both and are learnable during training. To improve efficiency, the Jacobi method is used to approximately compute Cayley polynomials.

ARMA filters. Bianchi et al. Bianchi et al. (2019) sought to address similar issues as identified in Levie et al. (2017). However, different from Cayley filters, they developed a first-order ARMA filter, which is approximated by a first-order recursion:

(5)

where and are the filter coefficients, , and . Accordingly, the frequency response is defined as:

(6)

where , , and Isufi et al. (2017b). Multiple ARMA filters can be applied in parallel to obtain a ARMA filter. However, the memory complexity of parallel ARMA filters is times higher than ARMA graph filters.

We make some remarks on how these existing spectral filters are related to each other. (i) As discussed in Bianchi et al. (2019); Levie et al. (2017); Liao et al. (2019), polynomial filters (e.g. Chebyshev and Lanczos filters) can be approximately treated as a special kind of rational polynomial filters. (ii) Further, Chebyshev filters can be regarded as a special case of Lanczos filters. (iii) Although both Cayley and ARMA filters are rational polynomial filters, they differ in how they approximate the matrix inverse implied by the denominator of a rational function. Cayley filters use a fixed number of Jacobi iterations, while ARMA filters use a first-order recursion plus a parallel bank of ARMA. (iv) ARMA by Bianchi et al. Bianchi et al. (2019) is similar to GCN by Kipf et al. Kipf and Welling (2017) because they both consider localization within 1-hop neighborhood.

3 Proposed Method

We introduce a new class of spectral graph filters, called feedback-looped filters, and propose a spectral CNN for graphs with feedback-looped filters, namely Distributed Feedback-Looped Networks (DFNets). We also discuss optimization techniques and analyze theoretical properties.

3.1 Feedback-Looped Filters

Feedback-looped filters belong to a class of Auto Regressive Moving Average (ARMA) filters Isufi et al. (2017a, b). Formally, an ARMA filter is defined as:

(7)

The parameters and refer to the feedback and feedforward degrees, respectively. and are two vectors of complex coefficients. Computing the denominator of Eq. 7 however requires a matrix inversion, which is computationally inefficient for large graphs. To circumvent this issue, feedback-looped filters use the following approximation:

(8)

where , , , and is the largest eigenvalue of . Accordingly, the frequency response of feedback-looped filters is defined as:

(9)

To alleviate the issues of gradient vanishing/exploding and numerical instabilities, we further introduce two techniques in the design of feedback-looped filters: scaled-normalization and cut-off frequency.

Scaled-normalization technique. To assure the stability of feedback-looped filters, we apply the scaled-normalization technique to increasing the stability region, i.e., using the scaled-normalized Laplacian , rather than just . This accordingly helps centralize the eigenvalues of the Laplacian and reduce its spectral radius bound. The scaled-normalized Laplacian consists of graph frequencies within , in which eigenvalues are ordered in an increasing order.

Cut-off frequency technique. To map graph frequencies within to a uniform discrete distribution, we define a cut-off frequency , where and refers to the largest eigenvalue of . The cut-off frequency is used as a threshold to control the amount of attenuation on graph frequencies. The eigenvalues are converted to binary values such that if and otherwise. This trick allows the generation of ideal high-pass filters so as to sharpen a signal by amplifying its graph Fourier coefficients. This technique also solves the issue of narrow frequency bands existing in previous spectral filters, including both polynomial and rational polynomial filters Defferrard et al. (2016); Levie et al. (2017). This is because these previous spectral filters only accept a small band of frequencies. In contrast, our proposed feedback-looped filters resolve this issue using a cut-off frequency technique, i.e., amplifying frequencies higher than a certain low cut-off value while attenuating frequencies lower than that cut-off value. Thus, our proposed filters can accept a wider range of frequencies and capture better characteristic properties of a graph.

3.2 Coefficient Optimisation

Given a feedback-looped filter with a desired frequency response: , we aim to find the optimal coefficients and that make the frequency response as close as possible to the desired frequency response, i.e. to minimize the following error:

(10)

However, the above equation is not linear w.r.t. the coefficients and . Thus, we redefine the error as follows:

(11)

Let , , with and with are two Vandermonde-like matrices. Then, we have . Thus, the stable coefficients and can be learned by minimizing as a convex constrained least-squares optimization problem:

(12)

Here, the parameter controls the tradeoff between convergence efficiency and approximation accuracy. A higher value of can lead to slower convergence but better accuracy. It is not recommended to have very low values due to potentially unacceptable accuracy. is the stability condition, which will be further discussed in detail in Section 3.4.

3.3 Spectral Convolutional Layer

We propose a CNN-based architecture, called DFNets, which can stack multiple spectral convolutional layers with feedback-looped filters to extract features of increasing abstraction. Let and . The propagation rule of a spectral convolutional layer is defined as:

(13)

where refers to a non-linear activation function such as . is a graph signal matrix where refers to the number of features. is a matrix of activations in the layer. and are two trainable weight matrices in the layer. To compute , a vertex needs access to its -hop neighbors with the output signal of the previous layer , and its -hop neighbors with the input signal from . To attenuate the overfitting issue, we add , namely kernel regularization Cortes et al. (2009), and a bias term . We use the xavier normal initialization method Glorot and Bengio (2010) to initialise the kernel and bias weights, the unit-norm constraint technique Douglas et al. (2000) to normalise the kernel and bias weights by restricting parameters of all layers in a small range, and the kernel regularization technique to penalize the parameters in each layer during the training. In doing so, we can prevent the generation of spurious features and thus improve the accuracy of prediction 111DFNets implementation can be found at: https://github.com/wokas36/DFNets.

In this model, each layer is directly connected to all subsequent layers in a feed-forward manner, as in DenseNet Huang et al. (2017). Consequently, the layer receives all preceding feature maps as input. We concatenate multiple preceding feature maps column-wise into a single tensor to obtain more diversified features for boosting the accuracy. This densely connected CNN architecture has several compelling benefits: (a) reduce the vanishing-gradient issue, (b) increase feature propagation and reuse, and (c) refine information flow between layers Huang et al. (2017).

3.4 Theoretical Analysis

Feedback-looped filters have several nice properties, e.g., guaranteed convergence, linear convergence time, and universal design. We discuss these properties and analyze computational complexities.

Convergence. Theoretically, a feedback-looped filter can achieve a desired frequency response only when Isufi et al. (2017b). However, due to the property of linear convergence preserved by feedback-looped filters, stability can be guaranteed after a number of iterations w.r.t. a specified small error Isufi et al. (2017a). More specifically, since the pole of rational polynomial filters should be in the unit circle of the z-plane to guarantee the stability, we can derive the stability condition by Eq. 7 in the vertex domain and correspondingly obtain the stability condition in the frequency domain as stipulated in Eq. 12 Isufi et al. (2017a).

Universal design. The universal design is beneficial when the underlying structure of a graph is unknown or the topology of a graph changes over time. The corresponding filter coefficients can be learned independently of the underlying graph and are universally applicable. When designing feedback-looped filters, we define the desired frequency response function over graph frequencies in a binary format in the uniform discrete distribution as discussed in Section 3.1. Then, we solve Eq. 12 in the least-squares sense for this finite set of graph frequencies to find optimal filter coefficients.

Spectral Graph Filter Type Learning Time Memory
Complexity Complexity Complexity
Chebyshev filters Defferrard et al. (2016) Polynomial
Lanczos filters Liao et al. (2019)
Cayley filters Levie et al. (2017) Rational polynomial
ARMA filters Bianchi et al. (2019)
parallel ARMA filters Bianchi et al. (2019)
Feedback-looped filters (ours)
Table 1: Learning, time and space complexities of spectral graph filters.

Complexity. When computing as in Eq. 8, we need to calculate for and for . Nevertheless, is computed only once because . Thus, we need multiplications for each in the first term in Eq. 8, and multiplications for the second term in Eq. 8. Table 1 summarizes the complexity results of existing spectral graph filters and ours, where refers to the number of Jacobi iterations in Levie et al. (2017). Note that, when (i.e., one spectral convolutional layer), feedback-looped filters have the same learning, time and memory complexities as Chebyshev filters, where .

4 Numerical Experiments

We evaluate our models on two benchmark tasks: (1) semi-supervised document classification in citation networks, and (2) semi-supervised entity classification in a knowledge graph.

4.1 Experimental Set-Up

Datasets. We use three citation network datasets Cora, Citeseer, and Pubmed Sen et al. (2008) for semi-supervised document classification, and one dataset NELL Carlson et al. (2010) for semi-supervised entity classification. NELL is a bipartite graph extracted from a knowledge graph Carlson et al. (2010). Table 2 contains dataset statistics Yang et al. (2016).

Dataset Type #Nodes #Edges #Classes #Features %Labeled Nodes
Cora Citation network 2,708 5,429 7 1,433 0.052
Citeseer Citation network 3,327 4,732 6 3,703 0.036
Pubmed Citation network 19,717 44,338 3 500 0.003
NELL Knowledge graph 65,755 266,144 210 5,414 0.001
Table 2: Dataset statistics.

Baseline methods. We compare against twelve baseline methods, including five methods using spatial graph filters, i.e., Semi-supervised Embedding (SemiEmb) Weston et al. (2012), Label Propagation (LP) Zhu et al. (2003), skip-gram graph embedding model (DeepWalk) Perozzi et al. (2014), Iterative Classification Algorithm (ICA) Lu and Getoor (2003), and semi-supervised learning with graph embedding (Planetoid*) Yang et al. (2016), and seven methods using spectral graph filters: Chebyshev Defferrard et al. (2016), Graph Convolutional Networks (GCN) Kipf and Welling (2017), Lanczos Networks (LNet) and Adaptive Lanczos Networks (AdaLNet) Liao et al. (2019), CayleyNet Levie et al. (2017), Graph Attention Networks (GAT) Veličković et al. (2017), and ARMA Convolutional Networks (ARMA) Bianchi et al. (2019).

We evaluate our feedback-looped filters using three different spectral CNN models: (i) DFNet: a densely connected spectral CNN with feedback-looped filters, (ii) DFNet-ATT: a self-attention based densely connected spectral CNN with feedback-looped filters, and (iii) DF-ATT: a self-attention based spectral CNN model with feedback-looped filters.

Model L2 reg. #Layers #Units Dropout [p, q]
DFNet 9e-2 5 [8, 16, 32, 64, 128] 0.9 [5, 3] 0.5
DFNet-ATT 9e-4 4 [8, 16, 32, 64] 0.9 [5, 3] 0.5
DF-ATT 9e-3 2 [32, 64] [0.1, 0.9] [5, 3] 0.5
Table 3: Hyperparameter settings for citation network datasets.

Hyperparameter settings. We use the same data splitting for each dataset as in Yang et al. Yang et al. (2016). The hyperparameters of our models are initially selected by applying the orthogonalization technique (a randomized search strategy). We also use a layerwise regularization (L2 regularization) and bias terms to attenuate the overfitting issue. All models are trained 200 epochs using the Adam optimizer Kingma and Ba (2015) with a learning rate of 0.002. Table 3 summarizes the hyperparameter settings for citation network datasets. The same hyperparameters are applied to the NELL dataset except for L2 regularization (i.e., 9e-2 for DFNet and DFnet-ATT, and 9e-4 for DF-ATT). For , we choose the best setting for each model. For self-attention, we use 8 multi-attention heads and 0.5 attention dropout for DFNet-ATT, and 6 multi-attention heads and 0.3 attention dropout for DF-ATT. The parameters , and are applied to all three models over all datasets.

4.2 Comparison with Baseline Methods

Table 4 summarizes the results of classification in terms of accuracy. The results of the baseline methods are taken from the previous works Kipf and Welling (2017); Liao et al. (2019); Veličković et al. (2017); Yang et al. (2016). Our models DFNet and DFNet-ATT outperform all the baseline methods over four datasets. Particularly, we can see that: (1) Compared with polynomial filters, DFNet improves upon GCN (which performs best among the models using polynomial filters) by a margin of 3.7%, 3.9%, 5.3% and 2.3% on the datasets Cora, Citeseer, Pubmed and NELL, respectively. (2) Compared with rational polynomial filters, DFNet improves upon CayleyNet and ARMA by 3.3 and 0.5 on the Cora dataset, respectively. For the other datasets, CayleyNet does not have results available in Levie et al. (2017). (3) DFNet-ATT further improves the results of DFNet due to the addition of a self-attention layer. (4) Compared with GAT (Chebyshev filters with self-attention), DF-ATT also improves the results and achieves 0.4%, 0.6% and 3.3% higher accuracy on the datasets Cora, Citeseer and Pubmed, respectively.

Additionally, we compare DFNet (our feedback-looped filters + DenseBlock) with GCN + DenseBlock and GAT + DenseBlock. The results are also presented in Table 4. We can see that our feedback-looped filters perform best, no matter whether or not the dense architecture is used.

Model Cora Citeseer Pubmed NELL
SemiEmb Weston et al. (2012) 59.0 59.6 71.1 26.7
LP Zhu et al. (2003) 68.0 45.3 63.0 26.5
DeepWalk Perozzi et al. (2014) 67.2 43.2 65.3 58.1
ICA Lu and Getoor (2003) 75.1 69.1 73.9 23.1
Planetoid* Yang et al. (2016) 64.7 75.7 77.2 61.9
Chebyshev Defferrard et al. (2016) 81.2 69.8 74.4 -
GCN Kipf and Welling (2017) 81.5 70.3 79.0 66.0
LNet Liao et al. (2019) 79.5 66.2 78.3 -
AdaLNet Liao et al. (2019) 80.4 68.7 78.1 -
CayleyNet Levie et al. (2017)   81.9 - - -
ARMA Bianchi et al. (2019) 84.7 73.8 81.4 -
GAT Veličković et al. (2017) 83.0 72.5 79.0 -
GCN + DenseBlock 82.7 0.5 71.3 0.3 81.5 0.5 66.4 0.3
GAT + Dense Block 83.8 0.3 73.1 0.3 81.8 0.3 -
DFNet (ours) 85.2 0.5 74.2 0.3 84.3 0.4 68.3 0.4
DFNet-ATT (ours) 86.0 0.4 74.7 0.4 85.2 0.3 68.8 0.3
DF-ATT (ours) 83.4 0.5 73.1 0.4 82.3 0.3 67.6 0.3
Table 4: Accuracy (%) averaged over 10 runs (* was obtained using a different data splitting in Levie et al. (2017))

.

4.3 Comparison under Different Polynomial Orders

In order to test how the polynomial orders and influence the performance of our model DFNet, we conduct experiments to evaluate DFNet on three citation network datasets using different polynomial orders and . Figure 2 presents the experimental results. In our experiments, and turn out to be the best parameters for DFNet over these datasets. In other words, this means that feedback-looped filters are more stable on and than other values of and . This is because, when and , Eq. 12 can obtain better convergence for finding optimal coefficients than in the other cases. Furthermore, we observe that: (1) Setting to be too low or too high can both lead to poor performance, as shown in Figure 2.(a), and (2) when is larger than , the accuracy decreases rapidly as shown in Figure 2.(b). Thus, when choosing and , we require that holds.

Figure 2: Accuracy (%) of DFNet under different polynomial orders and .

4.4 Evaluation of Scaled-Normalization and Cut-off Frequency

To understand how effectively the scaled-normalisation and cut-off frequency techniques can help learn graph representations, we compare our methods that implement these techniques with the variants of our methods that only implement one of these techniques. The results are presented in Figure 3. We can see that, the models using these two techniques outperform the models that only use one of these techniques over all citation network datasets. Particularly, the improvement is significant on the Cora and Citeseer datasets.

Figure 3: Accuracy (%) of our models in three cases: (1) using both scaled-normalization and cut-off frequency, (2) using only cut-off frequency, and (3) using only scaled-normalization.

4.5 Node Embeddings

We analyze the node embeddings by DFNets over two datasets: Cora and Pubmed in a 2-D space. Figures 5 and 5 display the visualization of the learned 2-D embeddings of GCN, GAT, and DFNet (ours) on Pubmed and Cora citation networks by applying t-SNE Maaten and Hinton (2008) respectively. Colors denote different classes in these datasets. It reveals the clustering quality of theses models. These figures clearly show that our model DFNet has better separated 3 and 7 clusters respectively in the embedding spaces of Pubmed and Cora datasets. This is because features extracted by DFNet yield better node representations than GCN and GAT models.

(a) GCN

(b) GAT

(c) DFNet (ours)

(a) GCN

(b) GAT

(c) DFNet (ours)
Figure 4: The t-SNE visualization of the 2-D node embedding space for the Pubmed dataset.
Figure 5: The t-SNE visualization of the 2-D node embedding space for the Cora dataset.
Figure 4: The t-SNE visualization of the 2-D node embedding space for the Pubmed dataset.

5 Conclusions

In this paper, we have introduced a spectral CNN architecture (DFNets) with feedback-looped filters on graphs. To improve approximation accuracy, we have developed two techniques: scaled normalization and cut-off frequency. In addition to these, we have discussed some nice properties of feedback-looped filters, such as guaranteed convergence, linear convergence time, and universal design. Our proposed model outperforms the state-of-the-art approaches significantly in two benchmark tasks. In future, we plan to extend the current work to time-varying graph structures. As discussed in Isufi et al. (2017b), feedback-looped graph filters are practically appealing for time-varying settings, and similar to static graphs, some nice properties would likely hold for graphs that are a function of time.

References

  • [1] J. Atwood and D. Towsley (2016) Diffusion-convolutional neural networks. In Advances in Neural Information Processing Systems (NeurIPS), pp. 1993–2001. Cited by: §1.
  • [2] F. M. Bianchi, D. Grattarola, L. Livi, and C. Alippi (2019) Graph neural networks with convolutional ARMA filters. arXiv preprint arXiv:1901.01343. Cited by: §1, §1, §2, §2, §2, Table 1, §4.1, Table 4.
  • [3] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun (2013) Spectral networks and locally connected networks on graphs. International Conference on Learning Representations (ICLR). Cited by: §1, §1, §1.
  • [4] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka, and T. M. Mitchell (2010) Toward an architecture for never-ending language learning. In Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI), Cited by: §4.1.
  • [5] F. R. Chung and F. C. Graham (1997) Spectral graph theory. American Mathematical Soc.. Cited by: §1, §2.
  • [6] C. Cortes, M. Mohri, and A. Rostamizadeh (2009) L2 regularization for learning kernels. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI), pp. 109–116. Cited by: §3.3.
  • [7] M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems (NeurIPS), pp. 3844–3852. Cited by: §1, §1, §1, §1, §2, §2, §3.1, Table 1, §4.1, Table 4, Table 5, Appendices.
  • [8] S. C. Douglas, S. Amari, and S. Kung (2000) On gradient adaptation with unit-norm constraints. IEEE Transactions on Signal processing 48 (6), pp. 1843–1847. Cited by: §3.3.
  • [9] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams (2015) Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems (NeurIPS), pp. 2224–2232. Cited by: §1.
  • [10] X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics (AIStats), pp. 249–256. Cited by: §3.3.
  • [11] W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems (NeurIPS), pp. 1024–1034. Cited by: §1, §1.
  • [12] D. K. Hammond, P. Vandergheynst, and R. Gribonval (2011) Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis 30 (2), pp. 129–150. Cited by: §2, §2.
  • [13] M. Henaff, J. Bruna, and Y. LeCun (2015) Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163. Cited by: §1.
  • [14] W. Hu, Y. Huang, L. Wei, F. Zhang, and H. Li (2015) Deep convolutional neural networks for hyperspectral image classification. Journal of Sensors 2015. Cited by: §1.
  • [15] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 4700–4708. Cited by: 4th item, §3.3.
  • [16] E. Isufi, A. Loukas, and G. Leus (2017) Autoregressive moving average graph filters: a stable distributed implementation. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4119–4123. Cited by: §3.1, §3.4.
  • [17] E. Isufi, A. Loukas, A. Simonetto, and G. Leus (2017) Autoregressive moving average graph filtering. IEEE Transactions on Signal Processing 65 (2), pp. 274–288. Cited by: 3rd item, §2, §3.1, §3.4, §5.
  • [18] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: §4.1.
  • [19] T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), Cited by: §1, §2, §2, §2, §4.1, §4.2, Table 4, Appendices.
  • [20] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (NeurIPS), pp. 1097–1105. Cited by: §1.
  • [21] R. Levie, F. Monti, X. Bresson, and M. M. Bronstein (2017) Cayleynets: graph convolutional neural networks with complex rational spectral filters. IEEE Transactions on Signal Processing 67 (1), pp. 97–109. Cited by: §1, §1, §2, §2, §2, §2, §3.1, §3.4, Table 1, §4.1, §4.2, Table 4, Table 5.
  • [22] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei (2017) Fully convolutional instance-aware semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2359–2367. Cited by: §1.
  • [23] R. Liao, Z. Zhao, R. Urtasun, and R. S. Zemel (2019) LanczosNet: multi-scale deep graph convolutional networks. In Proceedings of the seventh International Conference on Learning Representation (ICLR), Cited by: §2, §2, §2, Table 1, §4.1, §4.2, Table 4, Appendices, Appendices.
  • [24] Q. Lu and L. Getoor (2003) Link-based classification. In Proceedings of the 20th International Conference on Machine Learning (ICML), pp. 496–503. Cited by: §4.1, Table 4.
  • [25] L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §4.5.
  • [26] B. Perozzi, R. Al-Rfou, and S. Skiena (2014) Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (SIGKDD), pp. 701–710. Cited by: §1, §4.1, Table 4.
  • [27] A. Sandryhaila and J. M. Moura (2013) Discrete signal processing on graphs: graph fourier transform. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6167–6170. Cited by: §2.
  • [28] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad (2008) Collective classification in network data. AI magazine 29 (3), pp. 93–93. Cited by: §4.1.
  • [29] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson (2014) CNN features off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 806–813. Cited by: §1.
  • [30] D. Shuman, S. Narang, P. Frossard, A. Ortega, and P. Vandergheynst (2013) The emerging field of signal processing on graphs: extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Processing Magazine 30, pp. 83–98. Cited by: §2.
  • [31] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) Graph attention networks. International Conference on Learning Representations (ICLR). Cited by: §1, §4.1, §4.2, Table 4, Appendices.
  • [32] J. Weston, F. Ratle, H. Mobahi, and R. Collobert (2012) Deep learning via semi-supervised embedding. In Neural Networks: Tricks of the Trade, pp. 639–655. Cited by: §4.1, Table 4.
  • [33] Z. Yang, W. W. Cohen, and R. Salakhutdinov (2016) Revisiting semi-supervised learning with graph embeddings. In Proceedings of The 33rd International Conference on Machine Learning (ICML), pp. 40–48. Cited by: §4.1, §4.1, §4.1, §4.2, Table 4.
  • [34] X. Zhu, Z. Ghahramani, and J. D. Lafferty (2003) Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International conference on Machine learning (ICML), pp. 912–919. Cited by: §4.1, Table 4.

Appendices

In the following, we provide further experiments on comparing our work with the others.

Comparison with different spectral graph filters. We have conducted an ablation study of our proposed graph filters. Specifically, we compare our feedback-looped filters, i.e., the newly proposed spectral filters in this paper, against other spectral filters such as Chebyshev filters and Cayley filters. To conduct this ablation study, we remove the dense connections from our model DFNet. The experimental results are presented in table 5. It shows that feedback-looped filters improve localization upon Chebyshev filters by a margin of 1.4%, 1.7% and 7.3% on the datasets Cora, Citeseer and Pubmed, respectively. It also improves upon Cayley filters by a margin of 0.7% on the Cora dataset.

Model Cora Citeseer Pubmed
Chebyshev filters [7] 81.2 69.8 74.4
Cayley filters [21] 81.9 - -
Feedback-looped filters (ours) 82.6 0.3 71.5 0.4 81.7 0.6
Table 5: Accuracy (%) averaged over 10 runs.

Comparison with LNet and AdaLNet using different data splittings. We have benchmarked the performance of our DFNet model against the models LNet and AdaLNet proposed in [23], as well as Chebyshev, GCN and GAT, over three citation network datasets Cora, Citeseer and Pubmed. We use the same data splittings [5.2%, 3%, 1%, and 0.5%] as used in [23]. Note that, 5.2% is the standard data splitting that was also used in previous works [7, 19, 31]). All the experiments are repeated 10 times. For our model DFNet, we use the same hyperparameter settings as discussed in Section 4.2.

Training Split Chebyshev GCN GAT LNet AdaLNet DFNet
5.2% (standard) 78.0 1.2 80.5 0.8 82.6 0.7 79.5 1.8 80.4 1.1 85.2 0.5
3% 62.1 6.7 74.0 2.8 56.8 7.9 76.3 2.3 77.7 2.4 80.5 0.4
1% 44.2 5.6 61.0 7.2 48.6 8.0 66.1 8.2 67.5 8.7 69.5 2.3
0.5% 33.9 5.0 52.9 7.4 41.4 6.9 58.1 8.2 60.8 9.0 61.3 4.3
Table 6: Accuracy (%) averaged over 10 runs on the Cora dataset.
Training Split Chebyshev GCN GAT LNet AdaLNet DFNet
5.2% (standard) 70.1 0.8 68.1 1.3 72.2 0.9 66.2 1.9 68.7 1.0 74.2 0.3
1% 59.4 5.4 58.3 4.0 46.5 9.3 61.3 3.9 63.3 1.8 67.4 2.3
0.5% 45.3 6.6 47.7 4.4 38.2 7.1 53.2 4.0 53.8 4.7 55.1 3.2
0.3% 39.3 4.9 39.2 6.3 30.9 6.9 44.4 4.5 46.7 5.6 48.3 3.5
Table 7: Accuracy (%) averaged over 10 runs on the Citeseer dataset.
Training Split Chebyshev GCN GAT LNet AdaLNet DFNet
5.2% (standard) 69.8 1.1 77.8 0.7 76.7 0.5 78.3 0.3 78.1 0.4 84.3 0.4
0.1% 55.2 6.8 73.0 5.5 59.6 9.5 73.4 5.1 72.8 4.6 75.2 3.6
0.05% 48.2 7.4 64.6 7.5 50.4 9.7 68.8 5.6 66.0 4.5 67.2 7.3
0.03% 45.3 4.5 57.9 8.1 50.9 8.8 60.4 8.6 61.0 8.7 59.3 6.6
Table 8: Accuracy (%) averaged over 10 runs on the Pubmed dataset.

Tables 6-8 present the experimental results. Table 6 shows that DFNet performs significantly better than all the other models over the Cora dataset, including LNet and AdaLNet proposed in [23]. Similarly, Table 7 shows that DFNet performs significantly better than all the other models over the Citeseer dataset. For the Pubmed dataset, as shown in Table 8, DFNet performs significantly better than almost all the other models, except for only one case in which DFNet performs slightly worse than AdaLNet using the splitting 0.03%. These results demonstrate the robustness of our model DFNet.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
398487
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description