A class of randomized Subset Selection Methods for large complex networks

A class of randomized Subset Selection Methods for large complex networks

Amit Reza amit.reza@iitgn.ac.in Indian Institute of Technology Gandhinagar, Gujarat, India    Richa Tripathi richa.tripathi@iitgn.ac.in Indian Institute of Technology Gandhinagar, Gujarat, India
July 3, 2019
Abstract

Most of the real world complex networks such as the Internet, World Wide Web and collaboration networks are huge; and to infer their structure and dynamics one requires handling large connectivity (adjacency) matrices. Also, to find out the spectra of these networks, one needs to perform the EigenValue Decomposition(or Singular Value Decomposition for bipartite networks) of these large adjacency matrices or their Laplacian matrices. In the present work, we proposed randomized versions of the existing heuristics to infer the norm and the spectrum of the adjacency matrices. In an earlier work Tripathi and Reza (2019), we used Subset Selection (SS) procedure to obtain the critical network structure which is smaller in size and retains the properties of original networks in terms of its Principal Singular Vector and eigenvalue spectra. We now present a few randomized versions of SS (RSS) and their time and space complexity calculation on various benchmark and real-world networks. We find that the RSS based on using QR decomposition instead of SVD in deterministic SS is the fastest. We evaluate the correctness and the performance speed after running these randomized SS heuristics on test networks and comparing the results with deterministic counterpart reported earlier. We find the proposed methods can be used effectively in large and sparse networks; they can be extended to analyse important network structure in dynamically evolving networks owing to their reduced time complexity.

preprint: APS/123-QED

I Introduction

Most of the real world complex systems can be modelled and studied as complex networks. Also, the spectra of these networks have a direct correlation with topological properties of the networks and hence the processes on them. For example, the Internet network Albert et al. (1999) has power law distribution of eigenvalues with tail at larger eigenvalues. Its robustness to the random failure of nodes and absence of epidemic threshold is attributed to its scale-free topology. As a consequence, the eigenvalue spectra can be used as the fingerprint of the complex networks. In many spectacular works in complex networks theory, the spectra of complex networks Dorogovtsev et al. (2003), Rodgers et al. (2005) adjacency matrix (A) and the Laplacian (L), has been successfully employed to infer synchronizability of complex networks Estrada and Hatano (2008), Sole-Ribalta et al. (2013), the partition of network into modules or clusters Wang et al. (2008), and controllability of epidemic spreading on the networks Goltsev et al. (2012). Similarly, the eigenvectors have been used to identify the most influential network nodes Chen et al. (2012), Kitsak et al. (2010) and spectral partitioning of networks into communities Newman (2006), White and Smyth (2005), Newman (2013).

Any algorithm that uses network topology runs on polynomial time proportional to network size. To counter this, numerous efforts have been made to represent large complex networks in a reduced form such that their spectra are preserved. This reducibility is possible because of the presence of unimportant network structure, such that the network function is robust to shutdown or failure of its nodes and links. With a similar aim, we proposed using the Subset Selection (SS) algorithm Golub and Van Loan (2012) on the complex networks adjacency network Tripathi and Reza (2019) such that the obtained subset can be used to infer the spectra of original networks. This method is especially important for the cases of large complex networks where space and time complexity of analysing full adjacency matrices are too expensive. Also, although we have used the SS method to infer reduced representation of complex networks, it has been successfully used in a plethora of applications ranging from solving rank-deficient least square problems Golub and Van Loan (2012), in genetics Butler et al. (2005), in wireless communication Wilzeck and Kaiser (2008) and other information retrieval problems. Here we propose the randomized versions of the deterministic SS procedure and use them to infer the spectra of complex networks.

The Randomized SS procedures that we propose have reduced time and space complexity than the deterministic counterpart. The deterministic SS, uses first singular vectors (obtained by performing Singular Value Decomposition) of the matrix (adjacency matrix for complex networks) and performs QR factorization with column pivoting on the matrix formed by these vectors to obtain a permutation vector () which is then used to reorder the columns of in decreasing order of preference. The first columns of the reordered form the subset. The randomized versions of SS firstly project the original matrix vectors on a lower dimensional space using a random matrix Bingham and Mannila (2001). Following this, the SS procedure is performed on the projected vectors matrix. The randomized version runs on approximately half the time required by deterministic SS procedure. We also present a similar algorithm based on randomisation that offers reduced space complexity. The detailed calculation of time and space complexities for all methods are presented in the paper.

We show the results by comparing the singular value spectra obtained in the deterministic and randomized SS procedures among themselves and also with the spectra of the original network, for a predefined . We also present plots of Principal Singular Vector (PSV) components for the subset obtained with deterministic and randomized procedures and capture their cosine similarities with the PSV of the original adjacency matrix. The cosine similarities that we get were fairly good depicting that the PSVs of subsets derived from the randomized versions of SS have remarkable overlap with the PSV of . This observation justifies that randomized subsets can competently represent the original network as far as its spectra and PSV is concerned. Also, the randomized versions select the same nodes as the important or influential ones, as chosen by the deterministic SS procedure. Hence the randomized SS offers the same benefits as deterministic SS on the complex networks in lesser time. Without loss of generality, the randomized SS algorithms can be applied to any dataset in matrix form apart from complex network adjacency matrix.

Ii Background

ii.1 Mathematical preliminaries

  • Rank : For a matrix with rows and columns, rank is defined as . If there is collinearity in the dataset (matrix), rank is the number linearly independent rows vectors or the number linearly independent columns vectors in the matrix, whichever is smaller.

  • Energy of Matrix: Also given by the Frobenius norm, the energy of a matrix is the square root of the sum of the absolute square of all its elements. It is also given by square root of sum of squares of absolute singular values ’s of the matrix.

    (1)
  • Singular Value Decomposition (SVD): The Singular Value Decomposition is a matrix factorization technique and is a generalization of eigen decompostion to non-square matrices. For a matrix , the SVD is given by , where and are orthogonal matrices with the left and the right singular vectors of and is diagonal matrix with singular values along its diagonal.

  • QR: The QR decomposition of matrix is a matrix factorisation technique which decomposes matrix as , where is an orthogonal matrix, and is an upper triangular matrix.

  • QR-cp: QR factorization of with column pivoting is given by,

    (2)

    where and as before and is permutation matrix such that,

    (3)

    and, for each

  • Random Projection: It is a method used to reduce the dimensionality of data points lying in a Euclidean space. RP is generally used in handling and manipulating large datasets or manifolds to infer their intrinsic dimension or to know the principal directions.

  • Johnson–Lindenstrauss lemma: The JL lemma states that the dataset can be projected to a much lower dimension than the original such that the distances between the points in the datasets in the original and the projected space is preserved. It is used extensively in the problems of compressed sensing, dimensionality reduction and graph embedding.
    For any , a set of points in dimensional space, and a number there exists a linear map ,

    (4)

    for all .

  • Subset: A subset is the part of the original matrix with only significant columns such that it has the same Frobenius norm as that of the original matrix. The subset of a matrix exists if the singular value spectrum of the dataset is such that most of the matrix norm is made up of only a few singular values. The number of these dominant singular values is governed by the matrix rank .

  • Centrality: In complex network theory, the centrality of a node or an edge is an indicator of its importance in the network. Depending on the ways one defines importance there are many centrality measures regularly used such as degree centrality, betweenness centrality, eigenvector centrality, closeness centrality, PageRank centrality, etc.

ii.2 Subset selection procedure

The subset selection Kanjilal and Banerjee (1995) procedure is a well-known method to identify the essential column vectors of a data matrix . In our previous work, we showed the application of the subset selection procedure in large complex networks Tripathi and Reza (2019). The basic idea of the deterministic subset selection (SS) method is to discard the redundant columns of the data matrix and keep those columns which have a maximum contribution in the matrix in terms of energy preservation. Mathematically the data matrix, can be thought of as a collection of two blocks . Where contains linearly independent columns which can approximately span the entire column space of and , the collection of the redundant columns can be represented by the linear combination of the column vectors of .

It is understandable that the value of is dependent on the number of redundant columns. If the number of redundant columns is high, then will be decidedly less and vice-versa. Therefore the value of can be directly related to the numerical rank of the data matrix . Hence for a large data matrix, the size of the obtained subset depends on the rank-deficiency of the data-matrix as it directly translates the value of . To find the essential (non-redundant) block and redundant block respectively of , one has to compute the permutation matrix which helps to identify and separate out the two blocks as follows.

(5)

Therefore the whole SS procedure Kanjilal and Banerjee (1995) boils down to obtaining the optimal permutation matrix . One can end up with different realisations of the matrices, but the optimal one will be decided based on the following criteria. The optimal should follow the following constraints.

  1. The number of linearly independent columns ()of should represent the numerical rank of the data matrix .

  2. The residual difference between the norm of the linear combination of with should be minimal.

These constraints mentioned above can be full-filled by studying the singular value decomposition(SVD) of . SVD Golub and Van Loan (2012) is one of the best numerical methods to obtain the critical basis of any arbitrary data-set. The singular value spectra provide the corresponding weights of the basis vectors; therefore it can be used to find out the value of as the numerical rank (normally decided based on the top- non-zero values).

Therefore the first step of the deterministic SS procedure is to obtain the factors , , and using SVD.

(6)

The matrix represents eigenvectors of the left subspace of ; the matrix is a diagonal matrix with () positive non-zero entries (known as singular values) arranged in descending order of magnitude i.e () and the matrix represents eigenvectors of the right subspace of .

To find out the value of we need to maximized the preservation of the Forbenius norm such way that

(7)

where is the Frobenius norm of .

From eq.5 6, it is clear that can be obtained by permuting the first columns of the . This implies that finding out of important columns of can be equivalently translated as to a problem of the finding out the optimal permutation of the corresponding columns in . Hence the second essential step of the deterministic SS procedure is to obtain based on truncated . The complete version of the deterministic subset selection procedure is described in Algo-1.

Input: Data Matrix {}.
Output:
  // Compute the singular value decomposition
  // Choose first columns of .
  // Compute column-pivotal decomposition.
  // Permute the column of the data-matrix using permutation matrix .
  // Choose first columns.
Algorithm 1 Deterministic subset selection algorithm

It is notable that one can apply the deterministic SS procedure for a system represented by a complex network. For that purpose one has to compute the SVD of the network’s adjacency matrix () and need to follow the steps described in Algo-1 to obtain , where is a matrix representing important columns of original network adjacency matrix, .

ii.3 Bottleneck of deterministic SS procedure

It is clear from the Algo-1, two essential numerical steps are involved in the deterministic SS procedure. First is the computation of SVD of the data matrix to obtain the set of basis vectors and the second one is to apply column-pivotal QR decomposition on the truncated right singular basis to get the optimal permutation matrix. For a large data matrix, the computation of the SVD is a practical issue. Generally, for a data matrix , the theoretical time complexity of SVD is . Time complexity will also increase exponentially with an increase in the size of the data matrix. Also, the first step in Algo-1 involves computation of the full set of right singular basis and followed by the rejection of the less important set of basis from the full set of obtained basis. Therefore, there is a wastage of computational resources. For the large data sets where the expected value of is minimal, the waste of computational resources will increase tremendously. For large sparse complex networks, it is quite often the scenario. Therefore, deterministic SS procedure will take a long pre-processing time to provide SVD factors. In this work, we want to investigate the possible way to reduce this pre-processing time taken by SVD by introducing the random-projection (RP) based scheme to obtain the factors quickly. In this paper, we proposed a class of randomized SS procedure which can be used to omit this bottleneck. The second step (line no in the Algo-1) is essential to obtain , therefore for both deterministic and randomized SS procedures, this step is required.

(a) (b) (c)
(d) (e) (f)
Figure 1: The histograms in figures (a)-(f) show the distribution of ratio of pair wise distances between vectors in original high dimensional space and vectors in low dimensional space. The Guassians in these plots centred around mean value 0, show that the distances are preserved in projected space as required for Random Projection method. Each of the figures correspond to . They stand for (a) Barabasi- Albert network, (b) Drosophila network, (c) Erdos-Renyi Network, (d) Friendship network, (e) Power Grid Network and (f) US Air network respectively.

Iii Randomized subset selection scheme

The deterministic SVD is a computationally expensive method for large data sets. Furthermore, it is challenging to parallelize the standard SVD technique to utilise advanced computer architectures. Recently developed randomized algorithms based on RP Halko et al. (2011) for low-rank approximation, are computationally efficient, accurate up-to a certain precision and robust. Therefore it outperforms the traditional matrix factorisation scheme in many practical problems Halko et al. (2011), Halko et al. (2009). The randomized algorithm is powerful because of the following reasons.

  • The scheme is computationally efficient.

  • The main involved operations (steps) can be optimised on modern computational architecture.

The key concept of the randomized scheme is to exploit the randomness to construct a surrogate lower dimensional () data matrix which captures the maximum information of a high-dimensional input data matrix . The apparent assumption is that the input data matrix has a low-rank structure, i.e., the numerical rank is smaller than its original dimensions. Without loss of generality, we assume that , therefore to transform from original data matrix to a surrogate data matrix , one has to generate a random test matrix and has to operate on such a way that

(8)

Essentially, the operation described in Eq.8 is known as RP of the original data vectors to a lower dimensional space. Basically, operating onto compressed the column-vectors () of and projected them to a lower dimension space . Therefore, column vectors transformed from dimensional to dimensional vectors. Since, is very large and is very small, therefore this specific transformation () helps to reduce the computational cost as one can directly use the transformed vectors for further use. The point of concern is: how robust is the transformation? The RP can preserve the distance between each pair of the column vectors in the lower dimension within a -error. The J-L lemma ensures this. Hence, the surrogate matrix can be used to obtain the approximated first -right singular vectors (). Different approaches are also possible to obtain this. Algorithm- 2 -  5 are the representation of a class of such randomized versions.

We show the efficacy of the RP method in case of the data of the complex network. We computed the pairwise distances of the column vectors in the original and lower dimensional space for the six different real-world and model network adjacency matrices (Drosophila, Friendship, Power-Grid, US-Air, Barabasi-Albert, Erdos-Renyl Networks). The details (number of nodes, edges) of the data are mentioned in TABLE III. We are considering for complex network data. Figure 1 depicts the ratio of the pairwise distances of the vectors between the lower-dimension and original feature space. If the pair-wise distances are preserved, then the ratio should be . Therefore it expected that the mean of the distribution of the pair-wise distance would be and the variance will vary depending on the dimension of the projected space and number of projected vectors. Figure 1 shows that for six different complex network data, the mean of the distribution is . Therefore, the projected dimension can preserve the pair-wise distance of the column vectors in lower-dimensional space with high accuracy. Hence for these data randomized SS is applicable and the retrieval of right subspace is also possible. One should also note that for complex network data analysis, we are applying RP onto an adjacency matrix which is a symmetric matrix; therefore the compression of the column vectors or the compression of the row vectors provides the same results. Also, although in all the algorithms we denote the input matrix as a rectangular matrix , for the complex network data they are symmetric square matrices.

The projection in low dimension is mediated by the product of a random matrix with . The low dimension projection preserves the column norms, and the computation of any decomposition (SVD/ QR) of the matrix formed by these transformed columns gets computationally cheaper as compared to performing decomposition in original higher dimensional space. We employ this trait of RP in all proposed randomized SS procedures. In the following subsections, we present the randomized subset selection versions in details.

iii.1 Randomized Subset Selection 1

To reduce the computational complexity of the deterministic subset selection process, we propose the use of random projection to obtain top- right singular vectors. The original matrix which can be conceived as column vectors in dimensional space, is projected onto a low dimensional space such that it becomes . In lower dimension, it transforms such that there are column-vectors in dimensional space, obtained by a procedure as follows,

(9)

where . Now one can compute SVD of to obtain the right singular vectors of . Let the SVD factors of are as follows

(10)

The dimension of is . Theoretically, one can think of

(11)

Again,

(12)

Where we have assumed that which is the inherent property of random projection matrix due to the fact the columns are almost linearly independent as they are drawn from standard Gaussian distribution.

Using the fact from Eq.12, one can rewrite the Eq.11 as

(13)

Therefore one can approximate by without applying SVD to the original data matrix . The surrogate can be used to obtain it. It is notable that only top- right singular vector of can be approximated with high accuracy by . Other number of right singular vectors are not possible to retrieve. Hence this scheme can reduces the time required to compute the SVD as the number of columns are reduces by a factor . The rest of the steps are similar as deterministic SS. The algorithm for the procedure is given in Algorithm-2.

Input: Data Matrix {}, Random Projection Matrix {
Output:
  // Compression of the row-space
  // Compute the singular value decomposition
  // Compute column-pivotal decomposition.
  // Permute the column of the data-matrix using permutation matrix .
  // Choose first columns.
Algorithm 2 Randomized subset selection algorithm-I

For a system of complex network, if the network is very large, a large adjacency matrix has to be dealt with and full SVD of is needed to be performed in order to find its right singular vectors . The above prescribed randomized SS can be useful to reduce the time complexity to obtain the SVD factors promptly.

iii.2 Randomized Subset Selection 2

In step 2 of the previously prescribed randomized SS scheme (Algorithm- 2) involves SVD of a matrix of size ; hence the time complexity of the computing SVD will be , whereas House-holder transformed based QR decomposition Golub and Van Loan (2012) can reduce the time complexity further. In this sub-section, we proposed a new randomized SS scheme based on the QR-decomposition. Suppose QR-decomposition of can be factorized as

(14)

Here we use partial QR-decomposition as columns of are non-singular. Previously we have shown top- columns of can be approximated by . Therefore, our aim will be to obtain from the QR-factors.
Let that SVD factors of can be written as

(15)

Combining Eq:14 15, we can re-write

(16)

From Eq.9 16, it is clear that ; ; Hence one can compute and that can be used to approximate , which eventually approximates . These steps are written in Algorithm-3. This algorithm is not computationally efficient as it involves one QR-decomposition (step2) and another SVD factorization (step3) of a matrix of size . The the previously described Algorithm-2 is more computationally acceptable in comparison to this algorithm. It is further possible to discard the step3 of the algorithm by slightly modifying the algorithm which described in the next section as Algorithm-4.

Input: Data Matrix {}, Random Projection Matrix {
Output:
  // Compression of the row-space
  // Compute the QR decomposition of the
  // Compute the singular value decomposition
  // Compute column-pivotal decomposition.
  // Permute the column of the data-matrix using permutation matrix .
  // Choose first columns.
Algorithm 3 Randomized subset selection algorithm-II

iii.3 Randomized Subset Selection 3

Another advancement in terms of time complexity reduction is to discard the step3 of the previously described algorithm (Algorithm-3). From the Eq.9, it is clear that

(17)

Therefore, if one can compute QR-decomposition of as

(18)

then from Eq.17 18, it is easy to relate

(19)

This scheme is more efficient than previously described two different randomized algorithm (Algorithm-2, 3), as it involves only QR-decomposition (step2) of a matrix of size .

Input: Data Matrix {}, Random Projection Matrix {
Output:
  // Compression of the row-space
  // Compute the QR decomposition of the
1
  // Compute column-pivotal decomposition.
  // Permute the column of the data-matrix using permutation matrix .
  // Choose first columns.
Algorithm 4 Randomized subset selection algorithm-III

iii.4 Randomized Subset Selection 4

This section describes the fourth randomized version of the algorithm for subset selection. The previous three versions (Algorithm-2, 3, 4) have used SVD or QR factorization of a matrix of size . But, if is also very large then time complexity will not be reduced to the desired level. Therefore, we are interested in transforming the data matrix to a much smaller size and retrieval of the first -right singular vectors approximately. The prescribed algorithm in this section is based on constructing a matrix of size and approximates right singular vectors correctly. The transformation from is logical and can be explained through simple algebraic relations. The first transformation (), as described in the previous section, showed that the approximation of right singular vectors of is possible by the right singular vectors of . In this sub-section, we discuss the second transformation () and the retrieval of right singular vectors of from . From Eq.9, it is trivial to obtain,

(20)

The above Eq.20 shows that preserves the information about the left-singular vectors and it reflects a eigenvalue relation of .

(21)

Therefore, one can compute the eigenvalue decomposition of and obtain the approximated left singular vectors, singular values of . Suppose the eigenvalue decomposition of is

(22)

Therefore comparing Eq.21 22, we can get

The right singular vector of can be defined as

(23)
Input: Data Matrix {}, Random Projection Matrix {
Output:
  // Compression of the row-space
1
  // Eigenvalue decomposition of
2
  // Compute column-pivotal decomposition.
  // Permute the column of the data-matrix using permutation matrix .
  // Choose first columns.
Algorithm 5 Randomized subset selection algorithm-IV
(a) (b)
Figure 2: (a) The figure shows singular value spectra of the US Air network (with 322 nodes and 2126 edges) and the subsets with extracted through various algorithms proposed in the paper. (b) The figure shows the plot of Principal singular Vectors of US Air network adjacency matrix and that of subsets extracted through various algorithms. The excellent overlap between the PSVs of main network and subsets show that subset retains the spectral information of the main network.

iii.5 Computational complexity analysis

To understand the computational complexity of proposed Algorithm 2-5 in terms of the number of floating points operations, we evaluate the complexity of required steps. We mentioned earlier that the computation of column pivoting QR-decomposition of the truncated right singular matrix is the essential step for the all randomized and deterministic SS method. Therefore, in the computational complexity analysis, we excluded that cost. All randomized algorithms involves a multiplication of the data matrix with a random projection matrix, therefore it requires . Apart from that, the Algorithm 2 involves a SVD factorization of a matrix of size which cost . Similarly, the Algorithm 3, 4 involves a QR-factorization of a matrix of which cost , considering the House-Holder transformation. Whereas, the Algorithm 5 involves a eigenvalue decomposition in step3, therefore the time complexity is , which is very cheap for a low-rank structure in comparison to the other schemes.

Algorithm Matrix multiplication SVD/QR
Algorithm 1 -
Algorithm 2
Algorithm 3
Algorithm 4
Algorithm 5
Table 1: This table shows the theoretical time complexity involved in all the algorithms proposed in the paper.
Algorithm BA Drosophila ER Friendship Power Grid US Air
Algorithm 1 1.014 4.895 1.148 5.165 87.408 0.060
Algorithm 2 0.371 1.761 0.380 1.978 32.546 0.034
Algorithm 3 0.392 2.097 0.434 2.306 37.424 0.036
Algorithm 4 1.109 4.116 1.185 4.364 56.132 0.057
Algorithm 5 0.281 1.268 0.252 1.441 55.639 0.044
Table 2: The table shows the numerical time complexity involved in subset selection procedure for all the algorithms proposed in the paper for six networks.

The TABLE I. shows the detail of the complex complexity of all the described algorithms. The Algorithm 5 is more efficient in comparison to the other proposed randomized SS scheme in terms of practical applicability of the scheme. This scheme is highly parallelizable and suited for the modern computer architecture. This algorithm involves only one eigenvalue decomposition (step3) of a matrix of dimension . Due to the low-rank structure, the value of will be very small for a large data matrix, therefore this eigenvalue operation can be easily done on a single machine, whereas the other randomized algorithms involve a SVD/ QR-factorization of sized matrix, therefore for a large , single machine can not perform this operation.

If we consider the total time complexity of each of prescribed algorithm then the Algorithm4 is more cost efficient as Algorithm 5 involves a huge matrix multiplication which has a cost of . But all the prescribed algorithms are useful for a large data matrix, therefore we can’t store the data matrix into one specific machine, hence we have to store the data in a distributed architecture. In distributed architecture, the matrix multiplication cost will be reduced drastically. Hence, cost of the algebraic operations should not be included to compare the computationally efficiency of the methods. It would be better to provide a computational complexity analysis of these algorithms in a distributed system which is beyond the scope of this paper. Therefore the comparison based on the third column of the TABLE I makes more sense as performing SVD or QR on a large dimensional data matrix is the bottleneck. Based on that comparison, all the randomized algorithm out perform the complex complexity obtain by the deterministic SS.

In the TABLE II, we show the time required to obtain the subset using all the deterministic and randomized SS methods for six networks. The values of time units are averaged over independent 20 trials. If , it is expected that the time required will be much less that the value in the TABLE II, which is for .

Figure 3: The figure in leftmost panel shows the Les Miserables network Kunegis (2013) with 77 nodes and 254 edges. The middle figure shows the same network with subset network (38 nodes in orange color, 141 edges) extracted using deterministic SS process embedded in the main network. The third network shows the subset network (38 nodes in orange color, 126 edges)extracted using randomized SS1 process embedded in the main network. Please note that the base network nodes are scaled according to their eigenvector centralities in second and third panels. The orange color nodes being large in size implies that subset captures the most central nodes in the network.
Figure 4: The figure in leftmost panel shows embedded subset network in the the Les Miserables network Kunegis (2013) with 77 nodes and 254 edges. The subset network is extracted using Randomized SS2 has 38 nodes (shown in orange), 126 edges. The middle figure shows the same network with subset network (38 nodes in orange color, 126 edges) extracted using Randomized SS3 process embedded in the main network. The third network shows the subset network (38 nodes in orange color, 125 edges)extracted using randomized SS4 process embedded in the main network. Please note that the base network nodes are scaled according to their eigenvector centralities in second and third panels. The orange color nodes being large in size implies that subset captures the most central nodes in the network.

Iv Application of SS selection procedure in complex networks

For a detailed account of SS procedure application on complex networks, please refer Tripathi and Reza (2019). However, for the sake of completeness, we summarise it here. The essential condition or constraint to obtaining size reduced representation of complex networks is to retain its eigenvalue spectra and to quantify any loss in matrix energy in terms of Frobenius norms difference of the and the . On a slightly different note, the constraint conditions, i.e. preservation of full matrix energy in the subset and overlap of PSVs, may not be imposed on the SS procedure at all. For this case, one computes subset for an arbitrary value of , and the subset has most linearly independent columns. In Tripathi and Reza (2019), we showed that even when the subset size (number of columns = ) was chosen to be half the size of , the PSV of has more than 99 overlap with PSV of also the loss of matrix energy was very minimal.

The amount of overlap between PSVs was quantified using Cosine Similarity (CS), which is a measure of relative orientations of the two vectors (refer eq. 24, where and are PSVs of main adjacency matrix and the subset). Bounded between [0, 1], CS is maximum when the two vectors are oriented along the same direction and minimum when they are perpendicular to each other. In our case, the CS is a measure of the extent of information retrieval from the main matrix into the subset.

(24)

Obtaining the subset network is not very straightforward as the selected subset is generally a non-square matrix. To this end, the subset selection procedure was extended, and the rows of the selected subset were reordered using same (Refer eq. 25, has top rows, and has remaining ones). This is intuitively justified owing to the symmetric nature of the adjacency matrix. Hence the subset network has nodes with connections defined by subset matrix .

(25)

We find that subsets extracted through all the randomized versions of the deterministic SS procedure have excellent CS of Principal Singular Vector (PSV) with that of the original network adjacency matrix (see FIG. 2. This has two important implications. Firstly, the conservation of PSV in subsets implies that one can infer the influence of nodes from the subset PSV itself. Also, as the PSV entries are indicative of node influence (components of PSV being the eigenvalue centralities of the nodes), one can infer a node’s influence in spreading processes on the network using the subset PSV’s corresponding component. This is a good result, as doing the SVD of a large network to find PSV can be computationally expensive. Secondly, the subset network identifies the most important network structure in terms of its nodes and edges, such that selected network has enhanced information flow governing properties. We have shown the application of SS using all the deterministic and randomized SS on US Air network Kunegis (2013) in FIG. 3 and FIG. 4. One can see that all the deterministic and randomized algorithms can efficiently detect almost all high eigenvector centrality nodes and hence we can conclude that the subset nodes contribute maximally to the inverse participation ratio of the networks. Please refer to Tripathi and Reza (2019) for details. Also, we find that the loss in matrix norm of the subsets with is very minimal for all the network examples we took (refer TABLE III).

Type Networks (V, E) of network q (V, E) of SS network loss1 loss2 loss3 loss4 loss5
Weighed, real US Air (332, 2126) 166 (890,1635) 0.011 0.010 0.010 0.010 0.010 0.99
Les Miserables (77, 254) 38 (38, 141) 0.017 0.028 0.028 0.028 0.025 0.99
Train Bombing (64, 243) 32 (32, 123) 0.100 0.100 0.100 0.100 0.105 0.99
Unweighted, real Karate (34, 78) 20 (20, 47) 0.156 0.164 0.164 0.164 0.152 0.94
Cat Brain (65, 730) 32 (32, 247) 0.236 0.232 0.232 0.232 0.219 0.99
Drosophila (1781, 9016) 890 (890, 7026) 0.057 0.058 0.058 0.058 0.056 0.99
Power Grid (4941, 6594) 2470 (2470, 2863) 0.166 0.161 0.161 0.161 0.160 0.985
Jazz Musicians (198, 2742) 96 (96, 947) 0.232 0.214 0.214 0.214 0.220 0.98
Friendship (1858, 12534) 929 (929, 7618) 0.114 0.113 0.113 0.113 0.109 0.99
Unweighted, model Barabasi Albert (1000, 2991) 500 (500, 1418) 0.150 0.151 0.151 0.151 0.151 0.99
Erdos Renyi (1000, 7558) 500 (500, 2630) 0.230 0.234 0.234 0.234 0.233 0.98
Power Law (1000, 1360) 500 (500, 947) 0.082 0.087 0.087 0.087 0.082 0.88


Table 3: A table of SS results on model networks and weighted and unweighted real networks examples. The real networks were downloaded from KONECT Kunegis (2013) and model network types were generated using python module Networkx Hagberg et al. (2008). (V, E) represents the vertices and the edges in the networks. The columns loss1, loss2, loss3, loss4 and loss5 represent the Frobenius norm differences of main networks and the corresponding subsets obtained through Deterministic SS, RSS1, RSS2, RSS3 aand RSS4 respectively. CS represents cosine similarity between PSVs of main network adjacency and all the subsets. It is found to be greater than the value in the column for all the subsets.

V Conclusion and discussion

The present manuscript presents a class of randomized subset selection procedures on complex networks data. The main highlight of this work is the use of Random Projection scheme in the process of extract top most linearly independent columns from a data matrix. The RP method thrives on rank deficiency of input data matrix. Apart from reducing the time complexity incurred due to the performing of SVD of full data matrix required for deterministic SS procedure, RP can preserve the maximum information from the data matrix. We have verified the applicability of the proposed methods on complex network datasets. The complex networks, for example, the Internet, World Wide Web and traffic network, can be huge as well as dynamically evolving. The proposed methods owing to there reduced time complexity can unburden the computing devices a lot on such datasets. Finding the spectra and eigenvectors of complex networks is of paramount importance to infer its topological and functional properties. Also, finding the most critical nodes and links or the most functional network structure is currently one of the most researched topics in complex networks. We showed that using the SS procedure, the important network structure can be extracted which captures the most influential network nodes. Using randomized version of SS, this process can be fastened by many folds. We have taken , and showed all the results with this itself. However, can further be reduced depending on sparsity of data (small Network density) and time complexity can further be reduced. The determination of appropriate is altogether a different problem and serves as a prelude to our work. Although we have applied the randomized SS procedure to complex network data, these procedures can very well be extended to the general class of large datasets and real-time analysis of time evolving data.

References

  • Tripathi and Reza (2019) R. Tripathiand A. Reza, A subset selection based approach to finding important structure of complex networks, arXiv preprint arXiv:1903.04649  (2019).
  • Albert et al. (1999) R. Albert, H. Jeong, and A.-L. Barabási, Internet: Diameter of the world-wide web, nature 401, 130 (1999).
  • Dorogovtsev et al. (2003) S. N. Dorogovtsev, A. V. Goltsev, J. F. Mendes, and A. N. Samukhin, Spectra of complex networks, Physical Review E 68, 046109 (2003).
  • Rodgers et al. (2005) G. Rodgers, K. Austin, B. Kahng, and D. Kim, Eigenvalue spectra of complex networks, Journal of Physics A: Mathematical and General 38, 9431 (2005).
  • Estrada and Hatano (2008) E. Estradaand N. Hatano, Communicability in complex networks, Physical Review E 77, 036111 (2008).
  • Sole-Ribalta et al. (2013) A. Sole-Ribalta, M. De Domenico, N. E. Kouvaris, A. Diaz-Guilera, S. Gomez, and A. Arenas, Spectral properties of the laplacian of multiplex networks, Physical Review E 88, 032807 (2013).
  • Wang et al. (2008) G. Wang, Y. Shen, and M. Ouyang, A vector partitioning approach to detecting community structure in complex networks, Computers & Mathematics with Applications 55, 2746 (2008).
  • Goltsev et al. (2012) A. V. Goltsev, S. N. Dorogovtsev, J. G. Oliveira, and J. F. Mendes, Localization and spreading of diseases in complex networks, Physical review letters 109, 128702 (2012).
  • Chen et al. (2012) D. Chen, L. Lü, M.-S. Shang, Y.-C. Zhang, and T. Zhou, Identifying influential nodes in complex networks, Physica a: Statistical mechanics and its applications 391, 1777 (2012).
  • Kitsak et al. (2010) M. Kitsak, L. K. Gallos, S. Havlin, F. Liljeros, L. Muchnik, H. E. Stanley, and H. A. Makse, Identification of influential spreaders in complex networks, Nature physics 6, 888 (2010).
  • Newman (2006) M. E. Newman, Modularity and community structure in networks, Proceedings of the national academy of sciences 103, 8577 (2006).
  • White and Smyth (2005) S. Whiteand P. Smyth, A spectral clustering approach to finding communities in graphs, in Proceedings of the 2005 SIAM international conference on data mining (SIAM, 2005) pp. 274–285.
  • Newman (2013) M. E. Newman, Spectral methods for community detection and graph partitioning, Physical Review E 88, 042822 (2013).
  • Golub and Van Loan (2012) G. H. Goluband C. F. Van Loan, Matrix computations, Vol. 3 (JHU Press, 2012).
  • Butler et al. (2005) J. M. Butler, D. T. Bishop, and J. H. Barrett, Strategies for selecting subsets of single-nucleotide polymorphisms to genotype in association studies, in BMC genetics, Vol. 6 (BioMed Central, 2005) p. S72.
  • Wilzeck and Kaiser (2008) A. Wilzeckand T. Kaiser, Antenna subset selection for cyclic prefix assisted mimo wireless communications over frequency selective channels, EURASIP Journal on Advances in Signal Processing 2008, 130 (2008).
  • Bingham and Mannila (2001) E. Binghamand H. Mannila, Random projection in dimensionality reduction: applications to image and text data, in Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining (ACM, 2001) pp. 245–250.
  • Kanjilal and Banerjee (1995) P. P. Kanjilaland D. N. Banerjee, On the application of orthogonal transformation for the design and analysis of feedforward networks, IEEE Transactions on Neural Networks 6, 1061 (1995).
  • Halko et al. (2011) N. Halko, P.-G. Martinsson, and J. A. Tropp, Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions, SIAM review 53, 217 (2011).
  • Halko et al. (2009) N. Halko, P.-G. Martinsson, and J. A. Tropp, Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decompositions,  (2009).
  • Kunegis (2013) J. Kunegis, Konect: the koblenz network collection, in Proceedings of the 22nd International Conference on World Wide Web (ACM, 2013) pp. 1343–1350.
  • Hagberg et al. (2008) A. Hagberg, P. Swart, and D. S Chult, Exploring network structure, dynamics, and function using NetworkX, Tech. Rep. (Los Alamos National Lab.(LANL), Los Alamos, NM (United States), 2008).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
363412
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description