Homogeneous Network Embedding for Massive Graphs via Reweighted Personalized PageRank

Homogeneous Network Embedding for Massive Graphs via Reweighted Personalized PageRank


Given an input graph and a node , homogeneous network embedding (HNE) maps the graph structure in the vicinity of to a compact, fixed-dimensional feature vector. This paper focuses on HNE for massive graphs, e.g., with billions of edges. On this scale, most existing approaches fail, as they incur either prohibitively high costs, or severely compromised result utility.

Our proposed solution, called Node-Reweighted PageRank (), is based on a classic idea of deriving embedding vectors from pairwise personalized PageRank (PPR) values. Our contributions are twofold: first, we design a simple and efficient baseline HNE method based on PPR that is capable of handling billion-edge graphs on commodity hardware; second and more importantly, we identify an inherent drawback of vanilla PPR, and address it in our main proposal . Specifically, PPR was designed for a very different purpose, i.e., ranking nodes in based on their relative importance from a source node’s perspective. In contrast, HNE aims to build node embeddings considering the whole graph. Consequently, node embeddings derived directly from PPR are of suboptimal utility.

The proposed approach overcomes the above deficiency through an effective and efficient node reweighting algorithm, which augments PPR values with node degree information, and iteratively adjusts embedding vectors accordingly. Overall, takes time and space to compute all node embeddings for a graph with edges and nodes. Our extensive experiments that compare against existing solutions over 7 real graphs demonstrate that achieves higher result utility than all the solutions for link prediction, graph reconstruction and node classification, while being up to orders of magnitude faster. In particular, on a billion-edge Twitter graph, terminates within 4 hours, using a single CPU core.




[Technical Report]



1 Introduction

Given a graph with nodes, a network embedding maps each node to a compact feature vector in (), such that the embedding vector captures the graph features surrounding . These embedding vectors are then used as inputs in downstream machine learning operations [AlibabaGE18, FaloutsosTSDMZ18, WuFCABW18]. A homogeneous network embedding (HNE) is a type of network embedding that reflects the topology of rather than labels associated with nodes or edges. HNE methods have been commonly applied to various graph mining tasks based on neighboring associated similarities, including node classification [rcos13], link prediction [bljl11], and graph reconstruction [HOPE16]. This paper focuses on HNE computation on massive graphs, e.g., social networks involving billions of connections. Clearly, an effective solution for such a setting must be highly scalable and efficient, while obtaining high result utility.

HNE is a well studied problem in the data mining literature, and there are a plethora of solutions. However, most existing solutions fail to compute effective embeddings for large-scale graphs. For example, as we review in Section 2, a common approach is to learn node embeddings from random walk simulations, e.g., in [deepwalk14, node2vec16]. However, the number of possible random walks grows exponentially with the length of the walk; thus, for longer walks on a large graph, it is infeasible for the training process to cover even a considerable portion of the random walk space. Another popular methodology is to construct node embeddings by factorizing a proximity matrix, e.g., in [AROPE18]. The effectiveness of such methods depends on the proximity measure between node pairs. As explained in Section 2, capturing multi-hop topological information generally requires a sophisticated proximity measure; on the other hand, the computation, storage and factorization of such a proximity matrix often incur prohibitively high costs on large graphs.

Figure 1: An example graph .
Figure 2: PPR for and in Fig. 2 ().
0.15 0.269 0.188 0.118 0.17 0.048 0.029 0.019 0.008
0.15 0.118 0.188 0.269 0.17 0.048 0.029 0.019 0.008
0.036 0.043 0.056 0.043 0.093 0.137 0.29 0.187 0.12
0.02 0.024 0.031 0.024 0.056 0.083 0.168 0.311 0.282

This paper revisits an attractive idea: constructing HNEs by taking advantage of personalized PageRank (PPR) [PPR03]. Specifically, given a pair of nodes , the PPR value of with respect to is the probability that a random walk from terminates at . PPR values can be viewed as a concise summary of an infinite number of random walk simulations, which, intuitively, should be helpful in building node embeddings. Realizing the full potential of PPR for scalable HNE computation, however, is challenging. One major hurdle is cost: materializing the PPR between each pair of nodes clearly takes space for nodes (e.g., in [yin2019scalable]), and evaluating even a single PPR value can involve numerous random walk simulations (e.g., in [APP17, verse18]).

Further, we point out that even without considering computational costs, it is still tricky to properly derive HNEs from PPR values. The main issue is that PPR was designed to serve a very different purpose, i.e., ranking nodes in based on their relative importance from a source node’s perspective. In other words, PPR is essentially a local measure. On the other hand, HNE aims to summarize nodes from the view of the whole graph. To illustrate this critical difference, consider the example in Fig. 2 with nodes -. Observe that between the node pair and , there are three different nodes connecting them, i.e., , and . In contrast, there is only one common neighbor between and . Intuitively, if we were to predict a new edge in the graph, it is more likely to be (, ) than (, ). For instance, in a social network, the more mutual friends two users have, the more likely they know each other [brzozowski2011should]. However, as shown in Table 2, in terms of PPR values, we have , which tends to predict over and contradicts with the above intuition. This shows that PPR by itself is not an ideal proximity measure, at least for the task of link prediction. This problem is evident in PPR-based HNE methods, e.g., in [verse18, APP17], and a similar issue limits the effectiveness of a recent proposal [yin2019scalable], as explained in Section 2.

This paper addresses both the scalability and result utility issues of applying PPR to HNE computation with a novel solution called Node-Reweighted PageRank (). Specifically, we first present a simple and effective baseline approach that overcomes the efficiency issue of computing node embeddings using PPR values. The main proposal then extends this baseline by addressing the above-mentioned deficiency of traditional PPR. Specifically, augments PPR values with additional reweighting steps, which calibrate the embedding of each node to its in- and out- degrees. Intuitively, when a node has many neighbors (e.g., in Fig. 2), its embedding vector should be weighted up by considering its degree information, such that the proximity preserved in the inner product between the embedding vectors of the node and the other nodes in graph is amplified to reflect the importance of the node from the perspective of the whole graph, and vice versa. In , node reweighting is performed using an effective and scalable algorithm that iteratively adjusts node embeddings, whose cost is small compared to PPR computations. Overall, takes time and space to construct length- embeddings of all nodes in a graph with nodes and edges. In the common case that is small and the graph is sparse, the above complexities reduce to time and space.

We have conducted extensive experiments on 7 popular real datasets, and compared against 18 existing HNE solutions on three tasks: link prediction, graph reconstruction and node classification. In all settings, achieves the best result utility. Meanwhile, with a few exceptions, is often orders of magnitude faster than its competitors. In particular, on a Twitter graph with 1.2 billion edges, terminates within 4 hours on a single CPU core.

2 Related Work

Network embedding is a hot topic in graph mining, for which there exists a large body of literature as surveyed in [CaiZC18, abs-1711-08752, abs-1801-05852]. Here we review the HNE methods that are most relevant to this work.

Learning HNEs from random walks. A classic methodology for HNE computation is to learn embeddings from random walk simulations. Earlier methods in this category include [deepwalk14], [LINE15], [node2vec16] and [walklets17]. The basic idea is to learn the embedding of a node by iteratively “pulling” the embeddings of positive context nodes (i.e., those that are on the random walks originating from ) towards that of , and “pushing” the embeddings of negative context nodes (i.e., the nodes that are not connected to ) away from . Subsequent proposals [HARP18, ribeiro2017struc2vec] construct a multilayer graph over the original graph , and then perform random walks on different layers to derive more effective embeddings. Instead of using a predefined sampling distribution, [gao2018self] adaptively sample negative context nodes in terms of their informativeness. [chen2019exploiting] learns the embeddings of different centrality-based random walks, and combines these embeddings into one by weighted aggregation. Recent techniques [APP17] and [verse18] improve the quality of embeddings by refining the procedures for learning from PPR-based random walk samples. However, neither of them addresses the deficiency of traditional PPR as described in Section 1.

The main problem of random-walk-based HNE learning in general is their immense computational costs (proportional to the number of random walks), which can be prohibitive for large graphs. The high running time could be reduced with massively-parallel hardware, e.g., in [lerer2019pytorch], and/or with GPU systems, e.g., in [zhu2019graphy]. Nevertheless, they still incur a high financial cost for consuming large amounts of computational resources.

Learning HNEs without random walks. HNEs can also be learned directly from the graph structure using a deep neural network, without performing random walks. Training such a deep neural network, however, also incurs very high computational overhead, especially for large graphs [verse18]. Notably, [SDNE16] and [DNGR16] employ multi-layer auto-encoders with a target proximity matrix to learn embeddings. [kipf2016variational] combines the graph convolutional network [kipf2016semi] and auto-encoder models to learn embeddings. [lai2017preserving] utilizes a Siamese neural network to preserve both pointwise mutual information and global PageRank of nodes. [yu2018learning] and [tu2018deep] learn embeddings by feeding node sequences to a long short-term memory model (LSTM). [zhu2018deep] learns a Gaussian distribution in the Wasserstein space with the deep variational model as the latent representation of each node. [abu2018watch] applies graph attention mechanism to a closed-form expectation of the limited random-walk co-occurrence matrices [deepwalk14] to learn the embeddings. [wang2018graph], [dai2018adversarial] and [dai2019adversarial] adopt the popular generative adversarial networks (GAN) to accurately model the node connectivity probability. As demonstrated in our experiments, none of these methods scale to large graphs.

Constructing HNEs through matrix factorization. Another popular methodology for HNE computation is through factorizing a proximity matrix , where is the number of nodes in the input graph , and each entry signifies the proximity between a pair of nodes . The main research question here is how to choose an appropriate that (i) captures the graph topology well and (ii) is efficient to compute and factorize on large graphs. Specifically, to satisfy (i), each entry should accurately reflect the proximity between nodes via indirect connections, which can be long and complex paths. Meanwhile, to satisfy (ii) above, the computation / factorization of should be done in memory. This means that should either be sparse, or be efficiently factorized without materialization. In addition, note that for a directed graph , the proximity is also directed, meaning that it is possible that . Thus, methods that require to be symmetric are limited to undirected graphs and cannot handle directed graphs.

Earlier factorization-based work, including [tang2011leveraging, ahmed2013distributed, GraRep15, MNMF], directly computes before factorizing it to obtain node embeddings. For instance, spectral embedding [tang2011leveraging] simply outputs the top eigenvectors of the Laplacian matrix of an undirected graph as node embeddings. This method has limited effectiveness [deepwalk14, node2vec16], as the Laplacian matrix only captures one-hop connectivity information for each node. To remedy this problem, one idea is to construct a higher-order proximity matrix to capture multi-hop connectivity for each node [GraRep15, MNMF, NEU17]. However, such a higher-order proximity matrix is usually no longer sparse; consequently, materializing becomes prohibitively expensive for large graphs due to the space complexity for nodes.

Recent work [HOPE16, AROPE18, RandNE18] constructs network embeddings without materializing , to avoid extreme space overhead. Many of these methods, however, reply on the assumption that is symmetric; consequently, they are limited to undirected graphs as discussed above. For instance, [AROPE18] first applies an eigen-decomposition on the adjacency matrix of an undirected graph , and then utilizes the decomposition results to derive each node’s embedding to preserve proximity information, without explicitly constructing . Similar approaches have been adopted in [HOPE16, RandNE18] as well. In particular, [RandNE18] uses a Gaussian random projection of directly as node embeddings without factorization, in order to achieve high efficiency, at the cost of lower result utility.

The authors of [NetMF18] prove that random-walk-based methods such as as , and essentially perform matrix factorizations. Thus, they propose , which factorizes a proximity matrix that approximates the closed form representation of ’s implicit proximity matrix. However, requires materializing a dense , which is infeasible for large graphs. [qiu2019netsmf] improves the efficiency of by sparsifying using the theory of spectral sparsification. However, is still rather costly as it requires simulating a large number of random walks to construct . [zhang2019prone] learns embeddings via matrix factorization with the enhancement of spectral propagation. However, is mainly designed for node classification, and its accuracy is less competitive for other tasks such as link prediction and graph reconstruction. [liu2019general] iteratively fine-tunes the proximity matrix to obtain enhanced result effectiveness, at the expense of high comptuational costs.

HNE via Approximate Personalized PageRank. Although the idea of using PPR as the proximity measure to be preserved in the embeddings is often mentioned in random-walk-based solutions [APP17, verse18], these methods largely fail to scale due to numerous random walk simulations for the costly training process. A recent work [yin2019scalable] obtains better scalability by computing and factorizing a PPR-based proximity matrix instead. Specifically, builds node embeddings by factorizing the transpose proximity matrix, defined as , where and represent the approximate PPR matrices of the original graph and its transpose graph (i.e., obtained by reversing the direction of each edge in ), respectively.

The space and time complexities of are and , respectively, where is the error threshold for PPR values. In the literature of approximate PPR processing (e.g., [FORA17, TopPPR18, wang2019efficient, wang2019parallelizing, shi2019realtime]), is commonly set to , which would lead to prohibitively high space (i.e., ) and time (i.e., ) costs in . Instead, in [yin2019scalable], the authors fix to a constant and only retain PPR values greater than , which compromises result utility. Even so, is still far more expensive than the proposed solution , as shown in our experiments.

Further, as explained in Section 1, traditional PPR is not an ideal proximity measure for the purpose of HNE due to the former’s relative nature; this problem propagates to which uses the PPR-based transpose proximity measure, i.e., for each node pair . For instance, in the example of Table 2, we have , indicating that also tends to predict over in link prediction, which is counter-intuitive as we have explained in Section 1.

Other HNE methods. There also exist several techniques that generate embeddings without random walks, neural networks or matrix factorization. In particular, [ma2018hierarchical] applies expectation maximization to learn embeddings that capture the neighborhood structure of each node, as well as the latent hierarchical taxonomy in the graph. [RaRE18] considers both the proximity and popularity of nodes, and derive embeddings by maximizing a posteriori estimation using stochastic gradient descent. [donnat2018] represents each node’s neighborhood via a low-dimensional embedding by leveraging heat wavelet diffusion patterns, so as to capture structural roles of nodes in networks. [wang2018feature] transplants the feature hashing technique for word embeddings to embed nodes in networks. A common problem with the above methods is that they do not aim to preserve proximity information between nodes; consequently, they are generally less effective for tasks such as link prediction and graph reconstruction, as demonstrated in our experiments in Section 5.

3 Scalable PPR Computation and Factorization

This section presents , a simple and effective baseline approach to HNE that obtains node embeddings through factorizing a conceptual approximate PPR proximity matrix. Unlike previous methods, scales to billion-edge graphs without seriously compromising result quality. forms the foundation of the our main proposal , presented in Section 4. In what follows, Section 3.1 overviews and formally defines the main concepts. Section 3.2 presents the main contribution in : a scalable approximate PPR factorization algorithm. Table 1 summarizes frequent notations used throughout the paper.

Notation Description
= A graph with node set and edge set
The numbers of nodes and edges in , respectively
The in-degree of node
The out-degree of node
The adjacency, out-degree and transition matrices of
The random walk decay factor
The dimensionality of the embedding vectors
, The forward and backward embeddings, respectively
, The forward and backward weights for ’s forward and backward embeddings, respectively
Table 1: Frequently used notations.

3.1 Overview

As mentioned in Section 1, given an input graph , the goal of HNE is to construct a size- embedding for each node , where is a user-specified per-node space budget. The input graph can be either directed or undirected. For simplicity, in the following we assume that is directed; for an undirected graph, we simply replace each undirected edge () with two directed ones with opposing direction, i.e., () and (). Note that the capability to handle directed graphs is an advantage of our methods, compared to existing solutions that are limited to undirected graphs, e.g., [AROPE18, RandNE18, qiu2019netsmf, ma2018hierarchical, zhang2019prone].

In a directed graph, each node plays two roles: as the incoming end and outgoing end of edges, respectively. These two roles can have very different semantics. For instance, in a social graph, a user can deliberately choose to follow users who he/she is interested in, and is followed by users who are interested in him/her. The follower relationships and followee relationships of the same user should have different representations. This motivates building two separate embedding vectors and for each node , referred to as the forward and backward embeddings of , respectively. In our solutions, we assign equal space budget (i.e., ) to and .

One advantage of is that, it uses the PPR proximity matrix to do the factorization, without actually matrializing the matrix. Specifically, the definition of PPR is based on random walks, as follows. Suppose that we start a random walk from a source node . At each step, we terminate the walk with probability , and continue the walk (i.e., moving on to a random out-neighbor of ) with probability . Then, for each node , we define its PPR with respect to source node as the probability that the walk originating from terminates at .

Formally, let be an matrix where for the -th node and -th node in , and be the probability transition matrix of , i.e., , where is an out-neighbor of and denotes the out-degree of . Then,


directly uses as the proximity matrix, i.e., . The goal is then to factorize into the forward and backward embeddings of nodes of the input graph , such that for each pair of nodes and , i.e.:


Remark. Note that directly computing (and subsequently factorizing it into the node embeddings and ) is infeasible for a large graph. In particular, is a dense matrix that requires space for nodes, and Eq. (1) involves summing up an infinite series. To alleviate this problem, we could apply an approximate PPR algorithm [HubPPR16, FORA17, TopPPR18] to compute the top- largest PPR values for each node in , which reduces the space overhead to . Unfortunately, even the state-of-the-art approximate top- PPR algorithm, i.e., , is insufficient for our purpose. Specifically, takes time to compute the approximate top- PPR values for each node, where is a parameter that quantifies the difference between the top- and non-top- PPR values [TopPPR18]. To approximate the entire , we would need to invoke for every node, which incurs time super-quadratic to . Empirically, Ref. [TopPPR18] reports that running on a billion-edge Twitter graph (used in our experiments as well) takes about 15 seconds CPU time, for . The same graph contains over 41 million nodes, meaning that running on each of them would cost over 19 years of CPU time, which is infeasible even for a powerful computing cluster. While it is theoretically possible to reduce computational costs by choosing a small and/or a large error threshold in , doing so would lead to numerous zeros in , which seriously degrades the result quality. We address this challenge in the next subsection with a simple and effective solution.

3.2 PPR Approximation

Observe that our goal is to obtain the node embeddings and , rather than the PPR matrix itself. Thus, the main idea of is to integrate the computation and factorization of in the same iterative algorithm. Specifically, according to Eq. (1), can be viewed as the weighted sum of proximity values of different orders, i.e., one-hop proxmity, two-hop proximity, etc. Therefore, instead of first computing and then factorizing this dense matrix into node embeddings, we can instead start by factorizing the sparse first-order proximity matrix (i.e., ) into the initial embeddings and , and then iteratively refine and , thereby incorporating higher-order information into them. This allows us to avoid the substantial space and computational overheads incurred for the construction and factorization of the dense matrix .

First, we consider a truncated version of as follows:


where is a relative large constant (e.g., ). In other words, we set

where denotes an identity matrix. The rationale is that when is sufficiently large, is small, in which case becomes negligible. In addition, only affects the PPR from each node to itself, which has no impact on our objective in Eq. (2) since we only concern the PPR values between different nodes.

To decompose , observe that

where is the adjacency matrix of , and is an diagonal matrix where the -th diagonal element is . Instead of applying exact singular value decomposition (SVD) that is very time consuming, we factorize using the algorithm in [musco2015randomized] for randomized SVD, obtaining two matrices and a diagonal matrix given inputs and , such that . In short, reduces to a low-dimensional space by the Gaussian random projection and then performs SVD on the low-dimensional matrix. Given a relative error threshold , guarantees a error bound for spectral norm low-rank approximation, which is much tighter than the theoretical accuracy bounds provided by previous truncated SVD algorithms [halko2011finding, sarlos2006improved, clarkson2017low].

Given the output from , we set

After that, we compute

and set . This results in

Note that . It can be verified that Particularly, the following theorem establishes the accuracy guarantees of .

Theorem 1

Given , the dimensionality , the random walk decay factor , the number of iterations and error threshold for as inputs to Algorithm 1, it returns embedding matrices and ( ) that satisfy, for every pair of nodes with ,

and for every node ,

where is the -th largest singular value of . {proof} See Appendix A for the proof.

Theorem 1 indicates that the PPR value between any pair of nodes preserved in the embedding vectors and has absolute error at most and average absolute error of . Observe that the accuracy of the preserved PPR is restricted by and , namely the accuracy of the low-rank approximation, i.e., .

Finally, we use and as the initial forward and backward embeddings, respectively, for each node . Algorithm 1 summarizes the pseudo-code for this construction of and . Next, we present a concrete example.

Input: , , , , , , .
Output: , .
1 ;
2 ;
3 for  to  do
4        ;
7 return , ;
Algorithm 1
Figure 3: Illustration of Example 1 for the algorithm.
Example 1

Given input graph in Fig. 2 and input parameters , , we run Algorithm 1 on . It first applies on the adjacency matrix , which produces and as shown in Fig. 3.

first sets . Then, in each of the following iterations, the algorithm updates to . After repeating this process for iterations, scales by the weight and returns us and as in Fig. 3:

The inner product between and approximates . For example, consider node pairs and :

which are close to and in Table 2 respectively.

Time Complexity. By the analysis in [musco2015randomized], applying on requires time, where is a constant that controls the tradeoff between the efficiency and accuracy of SVD. In addition, Lines 2, 4, and 5 in Algorithm 1 respectively run in time. Therefore, the overall time complexity of Algorithm 1 is

which equals when and are regarded as constants.

4 Proposed NRP Algorithm

The algorithm presented in the previous section directly uses PPR as the proximity measure. However, as explained in Section 1, PPR by itself is not suitable for our purpose since it is a local measure, in the sense that PPR values are relative with respect to the source node. Consequently, PPR values for different source nodes are essentially incomparable, which is the root cause of the counter-intuitive observation in the example of Fig. 2 and Table 2.

In the proposed algorithm , we address the deficiency of PPR through a technique that we call node reweighting. Specifically, for any two nodes and , we aim to find forward and backward embeddings such that:


where is the PPR value of with respect to node as source, and are weights assigned to and , respectively. In other words, we let preserve a scaled version of . The goal of is then to find approximate node weights so that Eq. (4) properly expresses the proximity between nodes and . In , the node weights are learned through an efficient optimization algorithm, described later in this section. The proposed node reweighting overcomes the deficiency of PPR, which is confirmed in our experiments.

In the following, Section 4.1 explains the choice of node weights in . Sections 4.2 and 4.3 elaborate on the computation of node weights. Section 4.4 summarizes the complete algorithm.

4.1 Choice of Node Weights

As discussed before, the problem of PPR as a proximity measure is that it is a relative measure with respect to the source node. In particular, the PPR value does not take into account the number of out-going and in-coming edges that each node has. To address this issue, assigns to each node a forward weight and a backward weight , and uses instead of to gauge the strength of connection from to , as in Eq. (4). To compensate for the lack of node degree data in PPR values, we choose the forward and backward weights such that


In other words, for any nodes , we aim to ensure that (i) the “total strength” of connections from to other nodes is roughly equal to the out-degree of , and (ii) the total strength of connection from other nodes to is roughly to equal the in-degree of . The rationale is that if has a large out-degree, then it is more likely to be connected to other nodes, and hence, the proximity from to other nodes should be scaled up accordingly. The case for a node with a large in-degree is similar. In Section 5, we empirically show that this scaling approach significantly improves the effectiveness of our embeddings for not just link prediction but also other important graph analysis task such as graph reconstruction.

4.2 Learning Node Weights

Given the output and of (Algorithm 1), we use as an approximation of for any two different nodes and . Then we formulate an objective function for tuning node weights according to Eq. (5):


To explain, recall that we use to quantify the strength of connection from to , and hence, for any fixed (resp. ), measures the total strength of connections from to other nodes (resp. from other nodes to ). Therefore, by minimizing , we aim to ensure that the total strength of connections starting from (resp. ending at) each node is close to ’s out-degree (resp. in-degree), subject to a regularization term . In addition, we require that for all nodes to avoid negative node weights.

We derive an approximate solution for Eq. (4.2) using coordinate descent [wright2015coordinate]: We start with an initial solution and for each node , and then iteratively update each weight based on the other weights. In particular, for any node , the formula for updating is derived by taking the partial derivative of the objective function in Eq. (4.2) with respect to :



We identify the value of that renders the above partial derivative zero, i.e., . If the identified is smaller than , then we set it to instead to avoid negativity. This leads to the following formula for updating backward weight :


The formula for updating is similar and included in Appendix B for brevity.

By Eq. (8), each update of requires computing , , , and . Towards this end, a straightforward approach is to compute these variables directly based on their definitions in Eq. (4.2). This, however, leads to tremendous overheads. In particular, computing , , and requires a linear scan of for each node , which requires time. Deriving requires computing for each node , which incurs overhead. Furthermore, computing requires calculating for all , which takes time. Therefore, each update of takes , which leads to a total overhead of for updating all once. Apparently, this overhead is prohibitive for large graphs. To address this deficiency, in Section 4.3, we present a solution that reduces the overhead to instead of .

4.3 Accelerating Weight Updates

We observe that the updates of different node weights share a large amount of common computation. For example, for any node , deriving always requires computing . Intuitively, if we are able to reuse the result of such common computation for different nodes, then the overheads of our coordinate descent algorithm could be significantly reduced. In what follows, we elaborate how we exploit this idea to accelerate the derivation of , and .

Computation of . By the definitions of in Eq. (4.2),


Eq. (4.3) indicates that the values of all nodes share a common , while and of each node have in common. Observe that both and are independent of any backward weight. Motivated by this, we propose to first compute and , which takes time. After that, we can easily derive , and for any node with precomputed and . In that case, each update of , , and takes only time, due to Eq. (4.3). This leads to (instead of ) total computation time of , , and for all nodes.

Computation of . Note that

which can be rewritten as:


Observe that is independent of any backward weight. Thus, it can be computed once and reused in the computation of for all nodes. Meanwhile, both and are dependent on all of the backward weights, and hence, cannot be directly reused if we are to update each backward weight in turn. However, we note that and