A Bag-of-Paths Framework for Network Data Analysis (ArXiv preprint manuscript submitted for publication)

# A Bag-of-Paths Framework for Network Data Analysis (ArXiv preprint manuscript submitted for publication)

Kevin Françoisse Ilkka Kivimäki Amin Mantrach
Fabrice Rossi
Marco Saerens Université catholique de Louvain, Belgium Université Libre de Bruxelles, Belgium Yahoo! Research, Sunnyvale, California, USA Université Paris 1 Panthéon-Sorbonne, France
###### Abstract

This work develops a generic framework, called the bag-of-paths (BoP), for link and network data analysis. The central idea is to assign a probability distribution on the set of all paths in a network. More precisely, a Gibbs-Boltzmann distribution is defined over a bag of paths in a network, that is, on a representation that considers all paths independently. We show that, under this distribution, the probability of drawing a path connecting two nodes can easily be computed in closed form by simple matrix inversion. This probability captures a notion of relatedness between nodes of the graph: two nodes are considered as highly related when they are connected by many, preferably low-cost, paths. As an application, two families of distances between nodes are derived from the BoP probabilities. Interestingly, the second distance family interpolates between the shortest path distance and the resistance distance. In addition, it extends the Bellman-Ford formula for computing the shortest path distance in order to integrate sub-optimal paths by simply replacing the minimum operator by the soft minimum operator. Experimental results on semi-supervised classification show that both of the new distance families are competitive with other state-of-the-art approaches. In addition to the distance measures studied in this paper, the bag-of-paths framework enables straightforward computation of many other relevant network measures.

###### keywords:
Network science, link analysis, distance and similarity on a graph, shortest path distance, resistance distance, semi-supervised classification.

## 1 Introduction

### 1.1 General introduction

Network and link analysis is a highly studied field, subject of much recent work in various areas of science: applied mathematics, computer science, social science, physics, chemistry, pattern recognition, applied statistics, data mining & machine learning, to name a few Barabasi-2015 (); chung06 (); Estrada-2012 (); Kolaczyk-2009 (); Lewis09 (); Newman-2010 (); Thelwall04 (); Wasserman-1994 (). Within this context, one key issue is the proper quantification of the structural relatedness between nodes of a network by taking both direct and indirect connections into account. This problem is faced in all disciplines involving networks in various types of problems such as link prediction, community detection, node classification, and network visualization to name a few popular ones.

The main contribution of this paper is in presenting in detail the bag-of-paths (BoP) framework and defining relatedness as well as distance measures between nodes from this framework. The BoP builds on and extends previous work dedicated to the exploratory analysis of network data Kivimaki-2012 (); Kivimaki-2014 (); Mantrach-2009 (); Yen-08K (). The introduced distances are constructed to capture the global structure of the graph by using paths on the graph as a building block. In addition to relatedness/distance measures, various other quantities of interest can be derived within the probabilistic BoP framework in a principled way, such as betweenness measures quantifying to which extent a node is in between two sets of nodes Lebichot-2014 (), extensions of the modularity criterion for, e.g., community detection Devooght-2014 (), measures capturing the criticality of the nodes or robustness of the network, graph cuts based on BoP probabilities, and so on.

### 1.2 The bag-of-paths framework

More precisely, we assume given a weighted directed, strongly connected, graph or network where a cost is associated to each edge. Within this context, we consider a bag containing all the possible (either absorbing or non-absorbing) paths111Also called walks in the litterature. between pairs of nodes in . In a first step, following Akamatsu-1996 (); Mantrach-2009 (); Saerens-2008 (); Yen-08K (), a probability distribution on this countable set of paths can be defined by minimizing the total expected cost between all pairs of nodes while fixing the total relative entropy spread in the graph. This results in a Gibbs-Boltzmann distribution, depending on a temperature parameter , on the set of paths such that long (high-cost) paths have a low probability of being sampled from the bag, while short (low-cost) paths have a high probability of being sampled.

In this probabilistic framework, the BoP probabilities, , that a sampled path has node as its starting node and node as its ending node can easily be computed in closed form by a simple matrix inversion, where is the number of nodes in the graph. These BoP probabilities play a crucial role in our framework for that they capture the relatedness between two nodes and – the BoP probability will be high when the two nodes are connected by many, short, paths. In summary, the BoP framework has several interesting properties:

• It has a clear, intuitive, interpretation.

• The temperature parameter allows to monitor randomness by controlling the balance between exploitation and exploration.

• The introduction of independent costs results in a large degree of customization of the model, according to the problem requirements: some paths could be penalized because they visit undesirable nodes having adverse features.

• The framework is rich. Many useful quantities of interest can be defined according to the BoP probabilistic framework: distance measures, betweenness measures, etc. This is discussed in the conclusion.

• The quantities of interest are easy to compute.

It, however, also suffers from a drawback: the different quantities are computed by solving a system of linear equations, or by matrix inversion. More precisely, the distance between a particular node and all the other nodes can be computed by solving a system of linear equations, while all pairwise distances can be computed at once by inverting an square matrix. This results in computational complexity. Even more importantly, the matrix of distances necessitates storage, altough this can be alleviated by using, e.g., incomplete matrix factorization techniques.

This means that the different quantities can only be computed reasonably on small to medium size graphs (containing a few tens of thousand nodes). However, in specific applications like classification or extraction of top eigenvectors, we can avoid computing explicitly the matrix inversion (see PageRank and the power method Langville-2006 (), or large scale semi-supervised classification on graphs Mantrach-2011 ()). In addition, it is also possible to restrict the set of paths to “efficient paths”, that is, paths that do not backtrack (always getting further from the starting node), and compute efficiently the distances from the starting node by a recurrence formula, as proposed in transportation theory Dial71 ().

### 1.3 Deriving node distances from the BoP framework

The paper first introduces the BoP framework in detail. After that, the two families of distances between nodes are defined, and are coined the surprisal distance and the potential distance. Both distance measures satisfy the triangle inequality, and thus satisfy the axioms of a metric. Moreover, the potential distance has the interesting property of generalizing the shortest path and the commute cost distances by computing an intermediate distance, depending on the temperature parameter . When is close to zero, the distance reduces to the standard shortest path distance (emphasizing exploitation) while for , it reduces to the commute cost distance (focusing on exploration). The commute cost distance is closely related to the resistance distance FoussKDE-2005 (); Klein-1993 (), as the two functions are proportional to each other (as well as to the commute time distance) Chandra-1989 (); Kivimaki-2012 ().

This is of primary interest as it has been shown that both the shortest path distance and the resistance distance suffer from some significant flaws. While relevant in many applications, the shortest path distance cannot always be considered as a good candidate distance in network data. Indeed, this measure only depends on the shortest paths and thus does not integrate the “degree of connectivity” between the two nodes. In many applications, for a constant shortest path distance, nodes connected by many indirect paths should be considered as “closer” than nodes connected by only a few paths. This is especially relevant when considering relatedness of nodes based on communication, movement, etc, in a network which do not always happen optimally, nor completely randomly.

While the shortest path distance fails to take the whole structure of the graph into account, it has also been shown that the resistance distance converges to a useless value, only depending on the degrees of the two nodes, when the size of the graph increases (the random walker is getting “lost in space” because the Markov chain mixes too fast, see vonLuxburg-2010 ()). Moreover, the resistance distance, which is proportional to the commute cost distance, assumes a completely random movement or communication in the network, which is also unrealistic.

In short, shortest paths do not integrate the amount of connectivity between the two nodes whereas random walks quickly loose the notion of proximity to the initial node when the graph becomes larger vonLuxburg-2010 ().

There is therefore a need for introducing distances interpolating between the shortest path distance and the resistance distance, thus hopefully avoiding the drawbacks appearing at the ends of the spectrum. These quantities capture the notion of relative accessibility between nodes, a combination of both proximity in the network and amount of connectivity.

Furthermore, and interestingly, a simple local recurrence expression, extending the Bellman-Ford formula for computing the potential distances from one node of interest to all the other nodes is also derived. It relies on the use of the so-called soft minimum operator Cook-2011 () instead of the usual minimum. Finally, our experiments show that these distance families provide competitive results in semi-supervised learning.

### 1.4 Contributions and organization of the paper

Thus, in summary, this work has several contributions:

• It introduces a well-founded bag-of-paths framework capturing the global structure of the graph by using network paths as a building block.

• It is shown that the bag-of-hitting-paths probabilities can easily be computed in closed form. This fundamental quantity defines an intuitive relatedness measure between nodes.

• It defines two families of distances capturing the structural dissimilarity between the nodes in terms of relative accessibility. The distances between all pairs of nodes can be computed conveniently by inverting a matrix.

• It is shown that one of these distance measures has some interesting properties; for instance it is graph-geodetic and it interpolates between the shortest path distance and the resistance distance (up to a scaling factor).

• The framework is extended to the case where non-uniform priors are defined on the nodes.

• We prove that this distance generalizes the Bellman-Ford formula computing shortest path distances, by simply replacing the operator by the operator.

• The distances obtain promising empirical results in semi-supervised classification tasks when compared to other, kernel-based, methods.

Section 2 develops related work and introduces the necessary background and notation. Section 3 introduces the BoP framework, defines BoP probabilities and shows how it can be computed in closed form. Section 4 extends the framework to hitting, or absorbing, paths. In Section 5, the two families of distances as well as their properties are derived. Section 6 generalizes the framework to non-uniform priors on the nodes. An experimental study of the BoP framework with application to semi-supervised classification is presented in Section 7. Concluding remarks and extensions are discussed in Section 8.

## 2 Related work, background, and notation

### 2.1 Related work

This work is related to similarity measures on graphs for which some background is presented in this section. The presented BoP framework also has applications in semi-supervised classification, on which our experimental section will focus on in Section 7. A short survey related to this problem can be found in subsection 7.1.

Similarity measures on a graph determine to what extent two nodes in a graph resemble each other, either based on the information contained in the node attributes or based on the graph structure. In this work, only measures based on the graph structure will be investigated. Structural similarity measures can be categorized into two groups: local and global Lu-2011 (). On the one hand, local similarity measures between nodes consider the direct links from a node to the other nodes as features and use these features in various way to provide similarities. Examples include the cosine coefficient Dunham-2003 () and the standard correlation Wasserman-1994 (). On the other hand, global similarity measures consider the whole graph structure to compute similarities. Our short review of similarity measures is largely inspired by the surveys appearing in FoussKernelNN-2011 (); Mantrach-2009 (); Yen-2008 (); Yen-08K ().

Certainly the most popular and useful distance between nodes of a graph is the shortest path distance. However, as discussed in the introduction, it is not always relevant for quantifying the similarity of nodes in a network.

Alternatively, similarity measures can be based on random walk models on the graph, seen as a Markov chain. As an example, the commute time (CT) kernel has been introduced in FoussKDE-2005 (); Saerens04PCA () as the Moore-Penrose pseudoinverse, , of the Laplacian matrix. The CT kernel was inspired by the work of Klein & Randic Klein-1993 () and Chandra et al. Chandra-1989 (). More precisely, Klein & Randic Klein-1993 () suggested to use the effective resistance between two nodes as a meaningful distance measure, called the resistance distance. Chandra et al. Chandra-1989 () then showed that the resistance distance equals the commute time distance, up to a constant factor. The CT distance is defined as the average number of steps that a random walker, starting in a given node, will take before entering another node for the first time (this is called the average first-passage time Norris-1997 ()) and going back to the initial node.

It was then shown Saerens04PCA () that the elements of are inner products of the node vectors in the Euclidean space where these node vectors are exactly separated by the square root of the CT distance. The square root of the CT distance is therein called the Euclidean CT distance. The relationships between the Laplacian matrix and the commute cost distance (the expected cost (and not steps as for the CT) of reaching a destination node from a starting node and going back to the starting node) were studied in FoussKDE-2005 (). Finally, an electrical interpretation of the elements of can be found in Yen-2008 (). However, we saw in the introduction that these random-walk based distances suffer from some drawbacks (e.g., the so-called “lost in space” problem, vonLuxburg-2010 ())

Sarkar et al. Sarkar2007 () suggested a fast method for computing truncated commute time neighbors. At the same time, several authors defined an embedding that preserves the commute time distance with applications in various fields such as clustering Luh-2005 (), collaborative filtering FoussKDE-2005 (); Brand-05 (), dimensionality reduction of manifolds Ham2004 () and image segmentation Qiu2005 ().

Instead of taking the pseudoinverse of the Laplacian matrix, a simple regularization leads to a kernel called the regularized commute time kernel Ito-2005 (); Chebotarev-1997 (); Chebotarev-1998a (). Ito et al. Ito-2005 (), further propose the modified regularized Laplacian kernel by introducing another parameter controlling the importance of nodes. This modified regularized Laplacian kernel is also closely related to a graph regularization framework introduced by Zhou & Scholkopf in Zhou-04 (), extended to directed graphs in Zhou-05 ().

The exponential diffusion kernel, introduced by Kondor & Lafferty Kondor-2002 () and the Neumann diffusion kernel, introduced in Scholkopf-2002 () are similar and based on power series of the adjacency matrix. A meaningful alternative to the exponential diffusion kernel, called the Laplacian exponential diffusion kernel (see Kondor-2002 (); Smola-03 ()) is a diffusion model that substitutes the adjacency matrix with the Laplacian matrix.

Random walk with restart kernels, inspired by the PageRank algorithm and adapted to provide relative similarities between nodes, appeared relatively recently in Gori-2006WebKDD (); Pan-2004 (); Tong-2007 (). Nadler et al. Nadler-2005 (); Nadler-2006 () and Pons et al. Pons-2005 (); Pons-2006 () suggested a distance measure between nodes of a graph based on a diffusion process, called the diffusion distance. The Markov diffusion kernel has been derived from this distance measure in FoussKernelNN-2011 () and Yen-2011 (). The natural embedding induced by the diffusion distance was called diffusion map by Nadler et al. Nadler-2005 (); Nadler-2006 () and is related to correspondence analysis Yen-2011 ().

More recently, Mantrach et al. Mantrach-2009 (), inspired by Akamatsu-1996 (); Bell-1995 () and subsequently by Saerens-2008 (), introduced a link-based covariance measure between nodes of a weighted directed graph, called the sum-over-paths (SoP) covariance. They consider, in a similar manner as in this paper, a Gibbs-Boltzmann distribution on the set of paths such that high-cost paths occur with low probability whereas low-cost paths occur with a high probability. Two nodes are then considered as highly similar if they often co-occur together on the same – preferably short – path. A related co-betweenness measure between nodes has been defined in Kolaczyk-2009c ().

Moreover, as both the shortest path distance and the resistance distance show some issues, there were several attempts to define families of distances interpolating between the shortest path and more “global” distances, such as the resistance distance. In this context, inspired by Akamatsu-1996 (); Bell-1995 (); Saerens-2008 (), a parametrized family of dissimilarity measures, called the randomized shortest path (RSP) dissimilarity, reducing to the shortest path distance at one end of the parameter range, and to the resistance distance (up to a constant scaling factor) at the other end, was proposed in Yen-08K () and extended in Kivimaki-2012 (). Similar ideas appeared at the same time in Chebotarev-2011 (); Chebotarev-2012 (), based on considering the co-occurences of nodes in forests of a graph, and in Herbster-2009 (); vonLuxburg-2011 (), based on a generalization of the effective resistance in electric circuits. These two last families are metrics while the RSP dissimilarity does not satisfy the triangle inequality. The potential and the surprisal distances introduced in this work fall under the same catalogue of distance families. See also Kivimaki-2012 (); Guex-2015 (); Guex-2016 () for other, closely related, formulations of families of distances based on free energy and network flows.

### 2.2 Background and notation

We now introduce the necessary notation for the bag-of-paths (BoP) framework, providing both a relatedness index and a distance measure between nodes of a network. First, note that, in the sequel, column vectors are written in bold lowercase while matrices are in bold uppercase.

Consider a weighted directed graph or network, , assumed strongly connected, with a set of nodes (or vertices) and a set of edges (or arcs, links). An edge between node and node is denoted by or . Furthermore, it is assumed that we are given an adjacency matrix with elements quantifying in some way the affinity between node and node . When , node and node are said to be adjacent, that is, connected by an edge. Conversely, means that and are not connected. We further assume that there are no self-loops, that is, the . From this adjacency matrix, a standard random walk on the graph is defined in the usual way. The transition probabilities associated to each node are simply proportional to the affinities and then normalized:

 prefij=aij∑nj′=1aij′ (1)

Note that these transition probabilities will be used as reference probabilities later; hence the superscript “ref”. The matrix , containing elements , is stochastic and called the transition matrix of the natural or reference random walk on the graph.

In addition, we assume that a transition cost, , is associated to each link of the graph . If there is no edge between and , the cost is assumed to take an infinite value, . For consistency, if and only if . The cost matrix is the matrix containing the immediate costs as elements. We will assume that at least one element of is strictly positive. A path is a finite sequence of jumps to adjacent nodes on (including loops), initiated from a starting node , and stopping in an ending node . The total cost of a path is simply the sum of the local costs along , while the length of a path is the number of steps, or jumps, needed for following that path.

The costs are set independently of the adjacency matrix; they quantify the cost of a transition, depending on the problem at hand. They can, e.g., be defined according to some properties, or features, of the nodes or the edges in order to bias the probability distribution of choosing a path. In the case of a social network, we may, for instance, want to bias the paths in favor of domain experts. In that case, the cost of jumping to a node could be set proportional to the degree of expertise of the corresponding person. Therefore, walks visiting a large proportion of persons with a low degree of expertise would be penalized versus walks visiting persons with a high degree. Another example aims to favor hub-avoiding paths penalizing paths visiting hubs. Then, the cost can be simply set to the degree of the node. If there is no reason to bias the paths with respect to some features, costs are simply set equal to (paths are penalized by their length) or equal to (the elements of the adjacency matrix can then be considered as conductances and the costs as resistances).

## 3 The basic bag-of-paths framework

Roughly speaking, the BoP model will be based on the probability that a path drawn from a “bag of paths” has nodes and as its starting and ending nodes, respectively. According to this model, the probability of drawing a path starting in node and ending in node from the bag-of-paths can easily be computed in closed form. This probability distribution then serves as a building block for several extensions.

The bag-of-paths framework is introduced by first considering bounded paths and then paths of arbitrary length. For simplicity, we discuss non-hitting (or non-absorbing) paths first and then develop the more interesting bag-of-hitting-paths framework in the next section.

### 3.1 Sampling bounded paths according to a Gibbs-Boltzmann distribution

The present section describes how the probability distribution on the set of paths is assigned. In order to make the presentation rigorous, we will first have to consider paths of bounded length . Later, we will extend the results for paths with arbitrary length. Let us first choose two nodes, a starting node and an ending node and define the set of paths (including cycles) of length from to as . Thus, contains all the paths allowing to reach node from node in exactly steps.

Let us further denote as the total cost associated to path . Here, we assume that is a valid path from node to node , that is, it consists of a sequence of nodes where for all . As already mentioned, we assume that the total cost associated to a path is additive, i.e. . Then, let us define the set of all -length paths through the graph between all pairs of nodes as .

Finally, the set of all bounded paths up to length is denoted by . Note that, by convention, for and , zero-length paths are allowed with zero associated cost. Other types of paths will be introduced later; a summary of the mathematical notation appears in Table 1.

Now, we consider a probability distribution on this finite set , representing the probability of drawing a path from a bag containing all paths up to length . We search for the distribution of paths P minimizing the expected total cost-to-go, , among all the distributions having a fixed relative entropy with respect to a reference distribution, here the natural random walk on the graph (see Equation (1)). This choice naturally defines a probability distribution on the set of paths of maximal length such that high-cost paths occur with a low probability while short paths occur with a high probability. In other words, we are seeking for path probabilities, , , minimizing the expected total cost subject to a constant relative entropy constraint222In theory, non-negativity constraints should be added, but this is not necessary as the resulting probabilities are automatically non-negative.:

 \vline\lx@underaccentset{P(℘)}:℘∈P(≤t)minimize∑℘∈P(≤t)P(℘)~c(℘)subject to∑℘∈P(≤t)P(℘)log(P(℘)/~Pref(℘))=J0∑℘∈P(≤t)P(℘)=1 (2)

where is provided a priori by the user, according to the desired degree of randomness and represents the probability of following the path when walking according to the reference transition probabilities of the natural random walk on (see Equation (1)).

More precisely, we define , that is, the product of the transition probabilities along path – the likelihood of the path when the starting and ending nodes are known. Now, if we assume a uniform (non-uniform priors are considered in Section 4), independent, a priori probability, , for choosing both the starting and the ending node, then we set , which ensures that the reference probability is properly normalized333We will see later that the path likelihoods are already properly normalized in the case of hitting, or absorbing, paths: . See A..

The problem (2) can be solved by introducing the following Lagrange function

 L=∑℘∈P(≤t)P(℘)~c(℘)+λ⎡⎢⎣∑℘∈P(≤t)P(℘)logP(℘)~Pref(℘)−J0⎤⎥⎦+μ⎡⎣∑℘∈P(≤t)P(℘)−1⎤⎦ (3)

and optimizing over the set of path probabilities . As could be expected, setting its partial derivative with respect to to zero and solving the equation yields a Gibbs-Boltzmann probability distribution on the set of paths up to length Mantrach-2009 (),

 P(℘)=~Pref(℘)exp[−θ~c(℘)]∑℘′∈P(≤t)~Pref(℘′)exp[−θ~c(℘′)] (4)

where the Lagrange parameter plays the role of a temperature and is the inverse temperature.

Thus, as desired, short paths (having a low cost ) are favored in that they have a large probability of being followed. From Equation (4), we clearly observe that when , the path probabilities reduce to the probabilities generated by the natural random walk on the graph (characterized by the transition probabilities as defined in Equation (1)). In this case, as well. But when is large, the probability distribution defined by Equation (4) is biased towards low-cost paths (the most likely paths are the shortest ones). Note that, in the sequel, it will be assumed that the user provides the value of the parameter instead of , with . Also notice that the model could be derived thanks to a maximum entropy principle instead Jaynes-1957 (); Kapur-1992 ().

### 3.2 The bag-of-paths probabilities

Our BoP framework will be based on the computation of another important quantity derived from Equation (4): the probability of drawing a path starting in some node and ending in some other node from the bag of paths. For paths up to length this is provided by

 P(≤t)(s=i,e=j) =∑℘∈Pij(≤t)~Pref(℘)exp[−θ~c(℘)]∑℘′∈P(≤t)~Pref(℘′)exp[−θ~c(℘′)] =∑℘∈Pij(≤t)~πref(℘)exp[−θ~c(℘)]∑℘′∈P(≤t)~πref(℘′)exp[−θ~c(℘′)] (5)

where is the set of paths connecting node and node up to length . From (4), this quantity simply computes the probability mass of drawing a path connecting to . The paths in can contain loops and could visit nodes and several times during the trajectory444Note that another interesting class of paths, the hitting, or absorbing, paths – allowing only one single visit to the ending node – will be considered in the next section 4..

#### 3.2.1 Computation of the bag-of-paths probabilities for bounded paths

The analytical expression allowing to compute the quantity defined by Equation (5) will be derived in this subsection. Then, in the following subsection, its definition will be extended to the set of paths of arbitrary length (unbounded paths) by taking the limit .

We start from the cost matrix, , from which we build a new matrix, , as

 W=Pref∘exp[−θC]=exp[−θC+logPref] (6)

where is the transition probability matrix555Do not confuse matrix in bold with representing the reference probability of path . A summary of the notation appears in Table 1. of the natural random walk on the graph containing the elements , and the logarithm/exponential functions are taken elementwise. Moreover, is the elementwise (Hadamard) matrix product. Note that the matrix is not symmetric in general.

Then, let us first compute the numerator of Equation (5). Because all the quantities in the exponential of Equation (5) are summed along a path, and where each link lies on path , we immediately observe that element of the matrix ( to the power ) is where is the set of paths connecting the starting node to the ending node in exactly steps.

Consequently, the sum in the numerator of Equation (5) is

 ∑℘∈Pij(≤t)~πref(℘)exp[−θ~c(℘)]=t∑τ=0∑℘∈Pij(τ)~πref(℘)exp[−θ~c(℘)] =t∑τ=0[Wτ]ij=[t∑τ=0Wτ]ij=eTi(t∑τ=0Wτ)ej (7)

where is a column vector full of 0’s, except in position where it contains a 1. By convention, at time step 0, the random walker appears in node with probability one and a zero cost: . This means that zero-length paths (without any transition step) are allowed in . If, on the contrary, we want to dismiss zero-length paths, we could redefine as the set as paths of length at least one (the summation starts at instead of ) and proceed in the same manner.

This previous Equation (7) allows to derive the analytical form of the probability of drawing a bounded path (up to length ) starting in node and ending in . Indeed, replacing Equation (7) in Equation (5), and recalling that , we obtain

 P(≤t)(s=i,e=j) =eTi(t∑τ=0Wτ)ejn∑i,j=1eTi(t∑τ=0Wτ)ej=eTi(t∑τ=0Wτ)ejeT(t∑τ=0Wτ)e (8)

where is a vector of 1’s. Of course, there is no a priori reason to choose a particular path length; we will therefore consider paths of arbitrary length in the next section.

#### 3.2.2 Proceeding with paths of arbitrary length

Let us now consider the problem of computing the probability of drawing a path starting in and ending in from a bag containing paths of arbitrary length, and therefore usually containing an infinite number of paths. Following the definition in the bounded case (Equation (5)), this quantity will be denoted as and defined by

 P(s=i,e=j)=limt→∞P(≤t)(s=i,e=j)=∑℘∈Pij~πref(℘)exp[−θ~c(℘)]∑℘′∈P~πref(℘′)exp[−θ~c(℘′)] (9)

where is the set of paths (of all lengths) connecting to in the graph and the denominator is called the partition function of the bag-of-paths system,

 Z=∑℘∈P~πref(℘)exp[−θ~c(℘)] (10)

The quantity in Equation (9) will be called the bag-of-paths probability of drawing a path of arbitrary length starting from node and ending in node . As already stated, this key quantity captures a notion of relatedness, or similarity, between nodes of . From Equation (9), we observe that two nodes are considered as highly related (high probability of sampling them) when they are connected by many, preferably low-cost, paths, that is, when they are highly accessible. The quantity therefore integrates the concept of (indirect) connectivity, in addition to proximity (low-cost paths).

Now, from Equation (8), we need to compute

 P(s=i,e=j)=limt→∞P(≤t)(s=i,e=j)=limt→∞eTi(t∑τ=0Wτ)eje%T(t∑τ=0Wτ)e (11)

We thus need to compute the well-known power series of

 limt→∞t∑τ=0Wτ=∞∑t=0Wt=(I−W)−1 (12)

which converges if the spectral radius of is less than , . Because the matrix only contains non-negative elements and is strongly connected, a sufficient condition for is that it is substochastic Meyer-2000 (), which is always achieved for as for all and we assume that at least one element of is strictly positive. We therefore assume a .

Now, if we pose

 Z=(I−W)−1 (13)

with given by Equation (6), we can pursue the computation of the numerator of Equation (11),

 eTi(∞∑t=0Wt)ej=eTi(I−W)−1ej=eTiZej=[Z]ij=zij (14)

where is element of . By analogy with Markov chain theory, is called the fundamental matrix Kemeny-1960 (). Elementwise, following Equations (7-14), we have that

 (15)

which is actually related to the potential of a Markov chain Cinlar-1975 (); Norris-1997 (). From the previous equation, can be interpreted as

 zij =∞∑t=0[Wt]ij=δij+prefije−θcij+n∑k1=1prefik1prefk1je−θ(cik1+ck1j) +n∑k1=1n∑k2=1prefik1prefk1k2prefk2je−θcik1e−θck1k2e−θck2j+⋯ (16)

For the denominator of Equation (9) and (11), we immediately find

 Z=eTZe=z∙∙ (17)

where is the value of the partition function . Therefore, from Equation (11), the probability of drawing a path starting in and ending in in our bag-of-paths model is simply

 P(s=i,e=j)=zijZ, with Z=(I−W)−1 and Z=z∙∙ (18)

or, in matrix form,

 Π=Zz∙∙, with Z=(I−W)−1 (19)

where , called the bag-of-paths probability matrix, contains the probabilities for each starting-ending pair of nodes. Note that this matrix is not symmetric in general; therefore, in the case of an undirected graph, we might instead compute the probability of drawing a path or . The result is a symmetric matrix,

 Πsym=Π+ΠT (20)

and only the upper (or lower) triangular part of the matrix is relevant.

#### 3.2.3 An intuitive interpretation of the zij

An intuitive interpretation of the elements of the matrix can be provided as follows Saerens-2008 (); Mantrach-2009 (). Consider a special random walk defined by the transition probability matrix whose elements are . As has some row sums less than one (the rows of C containing at least one strictly positive cost ), the random walker has a nonzero probability of disappearing in each of these nodes which is equal to at each time step. Indeed, from Equation (6), it can be observed that the probability of surviving during a transition is proportional to , which makes sense: there is a smaller probability to survive edges with a high cost. In this case, the elements of the matrix, , can be interpreted as the expected number of times that an “evaporating”, or “killed” random walk, starting from node , visits node (see for instance Snell-1984 (); Kemeny-1960 ()) before being killed.

## 4 Working with hitting/absorbing paths: the bag of hitting paths

The bag-of-hitting-paths model described in this section is a restriction of the previously introduced bag-of-paths model in which the ending node of each path only appears once – at the end of the path. In other words, no intermediate node on the path is allowed to be the ending node , thus prohibiting looping on this node . Technically this constraint will be enforced by making the ending node absorbing666And killing, see later., as in the case of an absorbing Markov chain Snell-1984 (); Isaacson-1976 (); Kemeny-1960 (); Norris-1997 (). We will see later in this section that this model has some nice properties.

### 4.1 Definition of the bag-of-hitting-paths probabilities

Let be the set of hitting paths starting from and stopping once node has been reached for the first time ( is made absorbing). Let be the complete set of such hitting paths. Following the same reasoning as in the previous subsection, from Equation (9), when putting a Gibbs-Boltzmann distribution on , the probability of drawing a hitting path starting in and ending in is

 Ph(s=i,e=j)=∑℘∈Phij~πref(℘)exp[−θ~c(℘)]∑℘′∈Ph~πref(℘′)exp[−θ~c(℘′)]=∑℘∈Phij~πref(℘)exp[−θ~c(℘)]Zh (21)

and the denominator of this expression is also called the partition function, , for the hitting paths system this time. The quantity will be called the bag-of-hitting-paths probability of drawing a hitting path starting in and ending in . Note that in the case of unbounded hitting paths, the reference path probabilities can be simply defined as if we assume a uniform reference probability for drawing the starting and ending nodes. With this definition, it is shown in A that the probability is properly normalized, i.e., .

Obviously, for hitting paths, if we adopt the convention that zero-length paths are allowed, paths of length greater than 0 starting in node and ending in the same node are prohibited – in that case, the zero-length path is the only allowed path starting and ending in and we set its equal to 1.

Now, following the same reasoning as in previous section, the numerator of Equation (21) is

 ∑℘∈Phij~πref(℘)exp[−θ~c(℘)] =eTi(∞∑t=0(W(−j))t)ej=eTi(I−W(−j))−1ej =eTiZ(−j)ej=z(−j)ij (22)

where is now matrix of Equation (6) where the th row has been set to (node is absorbing and killing meaning that the th row of the transition matrix, , is equal to zero) and . This means that when the random walker reaches node , he immediately stops his walk there. This matrix is given by with being a column vector containing the th row of .

### 4.2 Computation of the bag-of-hitting-paths probabilities

In B, it is shown from a bag-of-paths framework point of view that the elements of can be computed simply and efficiently by

 z(−j)ij=[Z(−j)]ij=zijzjj (23)

which is a noteworthy result by itself. Note that this result has been re-derived in a more conventional, but also more tedious, way through the Sherman-Morrison formula by Kivimaki-2012 () in the context of computing randomized shortest paths dissimilarities in closed form.

Using this result, Equation (22) can be developed as

 ∑℘∈Phij~πref(℘)exp[−θ~c(℘)]=z(−j)ij=zijzjj≜zhij (24)

where we define the matrix containing the elements as – the fundamental matrix for hitting paths. The elements of the matrix are denoted by . From Equation (24), this matrix can be computed as with . Note that the diagonal elements of are equal to 1, . Moreover, when , and (at the limit, only shortest paths, without loops, are considered).

We immediately deduce the bag-of-hitting-paths probability including zero-length paths (Equation (21)),

 Ph(s=i,e=j) =∑℘∈Phij~πref(℘)exp[−θ~c(℘)]n∑i′,j′=1∑℘′∈Phi′j′~πref(℘′)exp[−θ~c(℘′)] (25)

where the denominator of Equation (25) is the partition function of the hitting paths model,

 Zh=n∑i,j=1∑℘∈Phij~πref(℘)exp[−θ~c(℘)]=n∑i,j=1(zij/zjj) (26)

In matrix form, denoting by the matrix of bag-of-hitting-paths probabilities ,

 Πh=ZD−1heTZD−1he, % with Z=(I−W)−1 and Dh=Diag(Z) (27)

The algorithm for computing the matrix is shown in Algorithm 1. The symmetric version for hitting paths is obtained by applying Equation (20) after the computation of . An interesting application would be to investigate graph cuts based on bag-of-hitting-paths probabilities instead of the standard adjacency matrix.

### 4.3 An intuitive interpretation of the zhij

In this section, we provide an intuitive description of the elements of the hitting paths fundamental matrix, . Let us consider a particular killed random walk with absorbing state on the graph whose transition probabilities are given by the elements of , that is, when and otherwise. In other words, the node is made absorbing and killing – it corresponds to hitting paths with node as hitting node. When the walker reaches this node, he stops his walk and disappears. Moreover, as for all , the matrix of transition probabilities is substochastic and the random walker has also a nonzero probability of disappearing at each step of its random walk and in each node for which . This stochastic process has been called an “evaporating random walk” in Saerens-2008 () or an “exponentially killed random walk” in Steele-2001 ().

Now, let us consider column (corresponding to the hitting, or absorbing, node) of the fundamental matrix of non-hitting paths, . Because the fundamental matrix is (Equation (13)), we easily obtain . Or, in elementwise form,

 {ziα=∑nj=1wijzjαfor each i≠αzαα=∑nj=1wαjzjα+1for absorbing % node α (28)

When considering hitting paths instead, (see Equation (24)) because for all (node is made absorbing and killing) so that the second line of Equation (28) – the boundary condition – becomes simply for hitting paths. Moreover, we know that for any . Thus, dividing the first line of Equation (28) by provides

 {zhiα=∑nj=1wijzhjαfor each i≠αzhαα=1for absorbing node α (29)

Interestingly, this is exactly the set of recurrence equations computing the probability of hitting node when starting from node (see, e.g., Kemeny-1960 (); Ross-2000 (); Taylor-1998 ()). Therefore, the represent the probability of surviving during the killed random walk from to with transition probabilities and node made absorbing. Said differently, it corresponds to the probability of reaching absorbing node without being killed during the walk.

## 5 Two novel families of distances based on hitting path probabilities

In this section, two families of distance measures are derived from the hitting path probabilities including zero-length paths777The results do not hold for a bag of paths excluding zero-length paths.. The second one benefits from some nice properties that will be detailed.

### 5.1 A first distance measure

The first distance measure is directly derived from the bag-of-paths probabilities introduced in the previous section.

#### 5.1.1 Definition of the distance

This section shows that the associated surprisal measure,

 −logPh(s=i,e=j),

quantifying the “surprise” generated by the outcome , when symmetrized, is a distance measure. This distance associated to the bag-of-hitting-paths is defined as follows

 Δsurij≜⎧⎨⎩−logPh(s=i,e=j)+logPh(s=j,e=i)2if% i≠j0if i=j (30)

where and are computed according to Equation (25) or (27) for the matix form. Obviously, and is symmetric. Moreover, is equal to zero if and only if .

It is shown in C that this quantity is a distance measure since it satisfies the triangle inequality, in addition to the other mentioned properties. This distance will be called the bag-of-hitting-paths surprisal distance.

#### 5.1.2 Computation of the distance

It can be computed by adding the following matrix operations to Algorithm 1:

• take elementwise logarithm for computing the potentials

• put diagonal to zero

We now turn to the development of the second distance measure.

### 5.2 A second distance measure

This subsection introduces a second measure enjoying some nice properties, based on the same ideas.

#### 5.2.1 Definition of the distance

The second distance measure automatically follows from Inequality (55) in C and is based on the quantity . For convenience, let us recall this inequality,

 Ph(s=i,e=k)≥ZhPh(s=i,e=j)Ph(s=j,e=k)

Then, from (Equation (25)), we directly obtain . Taking of both sides provides , or,

 ϕ(i,k)≤ϕ(i,j)+ϕ(j,k) (31)

where we defined

 ϕ(i,j)≜−1θlogzhij=−1θlog(zijzjj) (32)

and, from (31), the obviously verify the triangle inequality.

The quantity will be called the potential Cinlar-1975 () of node with respect to node . Indeed, it has been shown Garcia-Diez-2011b () that when computing the continuous-state continuous-time equivalent of the randomized shortest paths framework Saerens-2008 (), plays the role of a potential inducing a drift (external force) in the corresponding diffusion equation. From the properties and the probabilistic interpretation of the , both (as ) and (as ) hold.

This directed distance measure has three intuitive interpretations.

• First, let us recall from Equation (24) that is given by where is element of the fundamental matrix (see Equation (13)). From this last expression, can be interpreted (up to a scaling factor) as the logarithm of the expectation of the reward with respect to the path likelihoods, when considering absorbing random walks starting from node and ending in node .

• In addition, from Equation (29), it also corresponds to minus the log-likelihood of surviving during the killed, absorbing, random walk from to .

• Finally, it was shown in Kivimaki-2012 (), investigating further developments of the randomized shortest paths (RSP) dissimilarity, that the potential distance also corresponds to the minimal free energy of the system of hitting paths from to . Indeed, the RSP dissimilarity, defined as the expected total cost between and , is not a distance measure as it does not satisfy the triangle inequality. However, subtracting the entropy from the expected total cost (that is, computing the free energy) leads to a distance measure that was shown to be equivalent to the potential distance. Therefore the potential distance was called the free energy distance in Kivimaki-2012 (), which provides still another interpretation to the potential distance.

Inequality (31) suggests to define the distance . It has all the properties of a distance measure, including the triangle inequality, which is verified thanks to Inequality (31). Note that this distance measure can be expressed as a function of the surprisal distance (see Equation (30)) as for . This shows that the newly introduced distance is equivalent to the previous one, up to the addition of a constant and a rescaling.

The definition of the bag-of-hitting-paths potential distance is therefore

 Δϕij≜⎧⎨⎩ϕ(i,j)+ϕ(j,i)2if i≠j0if i=j, where ϕ(i,j)=−1θlog(zijzjj) (33)

and is element of the fundamental matrix (see Equation (13)).

#### 5.2.2 Computation of the distance

From Equation (27), it can be easily seen that the matrix containing the can be computed thanks to Algorithm 1 without the normalization steps 7 and 8. The distance matrix with elements is denoted as and can easily be obtained by adding the following matrix operations to Algorithm 1:

• take elementwise logarithm for computing the potentials

• symmetrize the matrix

• put diagonal to zero

Note that both the surprisal and the potential distances are well-defined as we assumed that is strongly connected.

### 5.3 Some properties of the potential and surprisal distances

The potential distance benefits from some interesting properties proved in the appendix:

• The potential distance is graph-geodetic, meaning that if and only if every path from to passes through Chebotarev-2011 () (see D for the proof).

• For an undirected graph , the distance approaches the shortest path distance when becomes large, . In that case, the Equation (33) reduces to the Bellman-Ford formula (see, e.g., Bertsekas-2000 (); Christofides_1975 (); Cormen-2009 ()) for computing the shortest path distance,