Graph Signal Sampling via Reinforcement Learning

Graph Signal Sampling via Reinforcement Learning


We formulate the problem of sampling and recovering clustered graph signal as a multi-armed bandit (MAB) problem. This formulation lends naturally to learning sampling strategies using the well-known gradient MAB algorithm. In particular, the sampling strategy is represented as a probability distribution over the individual arms of the MAB and optimized using gradient ascent. Some illustrative numerical experiments indicate that the sampling strategies based on the gradient MAB algorithm outperform existing sampling methods.


machine learning, reinforcement learning, multi-armed bandit, graph signal processing, total variation, complex networks.

1 Introduction

Modern information processing systems generate massive datasets which are often strongly heterogeneous, e.g., partially labeled mixtures of different media (audio, video, text). A quite successful approach to such datasets is based on representing the data as networks or graphs. In particular, we represent datasets by graph signals defined over an underlying graph, which reflects similarities between individual data points. The graph signal values encode label information which often conforms to a clustering hypothesis, i.e., the signal values (labels) of close-by nodes (similar data points) are similar.

Two core problems considered within graph signal processing (GSP) are (i) how to sample them, i.e., which signal values provide the most information about the entire dataset, and (ii) how to recover the entire graph signal from these few signal values (samples). These problems have been studied in [1, 2, 3, 4, 5, 6] which discussed convex optimization methods for recovering a graph signal from a small number of signal values observed on the nodes belonging to a given (small) sampling set. Sufficient conditions on the sampling set and clustering structure such that these convex methods are successful have been discussed in [4, 7].

Contribution. We propose a novel approach to the graph signal sampling and recovery it by interpreting it as a reinforcement learning (RL) problem. In particular, we interpret online sampling algorithm as an artificial intelligence agent which chooses the nodes to be sampled on-the-fly. The behavior of the sampling agent is represented by a probability distribution (“policy”) over a discrete set of different actions which are at the disposal of the sampling agent in order to choose the next node at which the graph signal is sampled. The ultimate goal is to learn a sampling policy which chooses signal samples that allow for a small reconstruction error.

Notation. The vector with all entries equal to zero is denoted . Given a vector with non-negative entries, we denote by the vector whose entries are the square roots of the entries of . Similarly, we denote the element-wise square of the vector as .

Outline. In Section 2 we formulate the problem of recovering a clustered graph signal from its values on few nodes forming a sampling set as a convex optimization problem. Our main contribution is in Section 3 where we introduce our RL-based sampling method. The results of some numerical experiments are presented in Section 4. We discuss our findings in Section 5 and finally conclude in Section 6.

2 Problem Formulation

We consider datasets which are represented by a data graph . The data graph is an undirected connected graph (no self-loops and no multi-edges) with nodes , which are connected by edges . Each node represents an individual data point and an edge connects nodes representing similar data points. The distance between two different nodes is defined as the length of the shortest path between those nodes. For a given node , we define its neighbourhood as

It will be handy to generalize the notion of neighbourhood and define, for some , the -step neighbourhood of a node as . The 1-step neighbourhood coincides with the neighbourhood of a node, i.e., .

In many applications we can associate each data point with a label . These labels induce a graph signal defined over the graph .

We aim at recovering a graph signal based on observing its values only for nodes belonging to a sampling set

Since acquiring signal values (i.e., labelling data points) is often expensive (requiring manual labor), the sampling set is typically much smaller than the overall dataset, i.e., . For a fixed sampling set size (sampling budget) it is important to choose the sampling set such that the signal samples carry maximal information about the entire graph signal.

The recovery of the entire graph signal from (few) signal samples is possible for clustered graph signals which do not vary too much over well-connected subset of nodes (clusters) (cf. [4, 8]). We will quantify how well a graph signal is aligned to the cluster structure using the total variation (TV)

Recovering a graph signal based on the signal values for the nodes of the sampling set can be accomplished by solving


This is a convex optimization problem with a non-differentiable objective function which precludes the use of simple gradient descent methods. However, by applying the primal-dual method of Pock and Chambolle [9] to solve the recovery problem (1), an efficient sparse label propagation algorithm has been obtained in [8].

A simple but useful model for clustered graph signals is:


with the cluster indicator signals

The partition underlying the signal model (2) can be chosen arbitrarily in principle. However, our methods are expected to be most useful if the partition matches the intrinsic cluster structure of the data graph . The clustered graph signals of the form (2) conform with the network topology, in the sense of having small TV , if the underlying partition consists of disjoint clusters with small cut-sizes. Relying on the clustered signal model (2), [7, Thm. 3] presents a sufficient condition on the choice of sampling set such that the solution of (1) coincides with the true underlying clustered graph signal of the form (2). The condition presented in [7, Thm. 3] suggests to choose the nodes in the sampling set preferably near the boundaries between the different clusters.

3 Signal Sampling as Reinforcement Learning

The problem of selecting the sampling set and recovering the entire graph signal from the signal values can be interpreted as a RL problem. Indeed, we consider the selection of the nodes to be sampled being carried out by an “agent” which crawls over the data graph . The set of actions our sampling agent may take is .

A specific action refers to the number of hops the sampling agent performs starting at the current node to reach a new node , which will be added to the sampling set, i.e., . In particular, the new node is selected uniformly at random among the nodes which belong to its -step neighbourhood (see Figure 1).


[scale = 0.8]neighbourhood.png

Figure 1: The filled node represents the current location of the sampling agent at time . We also indicate the -, - and -step neighbourhoods.

The problem of optimally selecting actions at given time can be formulated as a MAB problem. Each arm of the bandit is associated with an action. In our setup, a sampling strategy (or policy) amounts to specifying a probability distribution over the individual actions . We parametrize this probability distribution with a weight vector using the softmax rule:

The weight vector is tuned in the episodic manner with each episode amounting to selecting sampling set based on the policy . At each timestep the agent randomly draws an action according to the distribution and performs transition to the next node which is selected uniformly at random from the -step neighbourhood . As was mentioned earlier, the node is added to the sampling set, i.e., . We also record the action and add it to the action list, i.e., . The process continues until we obtain a sampling set with the prescribed size (sampling budget) .

Our goal is to learn an optimal policy for the sampling agent in order to obtain signal samples which allow recovery of the entire graph signal with minimum error. We assess the quality of the policy using the mean squared error (MSE) incurred by the recovered signal which is obtained via (1) using the sampling set by following policy :

The obtained reward is associated with all actions/arms which contributed to picking samples into sampling set during the episode. For example, if the sampling set has been obtained by pulling arms 1, 2 and 5, the obtained reward will be associated with all these arms, because we do not know what is the exact contribution of the specific arm to the finally obtained MSE.

The key idea behind gradient MAB is to update weights so that actions yielding higher rewards become more probable under [10, Chapter 2.8]. According to the aforementioned book weights update can be accomplished using gradient ascent algorithm:


The single difference between update rule (3) and one presented in the book [10, Eq. 2.10] is that in our case weights update is performed in the end of each episode and not after an arm pull. That is because we do not know reward immediately after pulling an arm and should wait until the whole sampling set is collected and reward is observed. The intuition behind the update equation (3) is that for each arm which has participated in picking a node into sampling set (), the weight is increased, whereas weights of remaining arms () are decreased. In both cases degree of weight increase/decrease is scaled by the reward obtained with help of this arm as well as by the learning rate . For faster convergence in our implementation, instead of stochastic gradient ascent we use mini-batch gradient ascent in combination with RMSprop technique [11] (see Algorithm 1 for implementation details).

Choice of the gradient MAB algorithm can be additionally justified by the study [12] which shows that in the environments with non-stationary rewards probabilistic MAB policy can result in higher expected reward in comparison to single-best action policies. In our problem non-stationarity of reward arises from the graph structure itself, i.e., reward distribution for a particular arm of a bandit depends on the location of the sampling agent. Suppose sampling budget is 2 and consider example presented in Figure 4. In case (a) sampling agent is initially located at node 4. By pulling arm #1 it can only pick node 3 which is in the other cluster. It is easy to verify that by using recovery method (1) graph signal will be perfectly reconstructed (MSE = 0). On the other hand, case (b) shows the situation when the agent can only pick nodes 2 or 3 belonging to the same cluster as currently sampled node, leading to non-zero reconstruction MSE.

Figure 4: Illustration of reward being conditioned on the position of a sampling agent. In this picture: red node – current position of the sampling agent, blue region – nodes within distance 1 from the sampling agent. Node indices are shown inside the nodes, signal values – outside.

The whole process of weight updates is repeated for sufficient number of episodes until convergence is reached and the optimal stochastic policy is attained. Described above learning procedure can be efficiently summarized in the form of pseudocode (see Algorithm 1).

2:data graph , sampling budget , batch size ,
5:      select starting node randomly
8:      for  do
14:      end for
18:      for  do
19:            for  do
21:            end for
22:      end for
24:      if  mod  then
28:      end if
29:until convergence is reached
Algorithm 1 Online Sampling and Reconstruction

Obtained probability distribution represents sampling strategy which incurs the minimum reconstruction MSE when using the convex recovery method (1).

4 Numerical Results

We now verify the effectiveness of the proposed sampling set selection algorithm using synthetic data and compare it to two other existing approaches, i.e., random walk sampling (RWS) [13] and uniform random sampling (URS) [14, Section 2.3]. We define a random graph with 10 clusters where sizes of clusters are drawn from the geometric distribution with probability of success . In accordance to the stochastic block model (SBM) [15] intra- and inter-cluster connection probabilities are parametrized as and . We then generate a clustered graph signal according to (2) with the signal coefficients for . Example of a typical instance of random graph with such parameters is shown in Figure 5.

Figure 5: Data graph obtained from the stochastic block model with and .

Given the model we generate training data consisting of random graphs and for each graph instance we run Algorithm 1 for 10000 episodes, which is sufficient to reach convergence. It is interesting to note that the algorithm outperforms RWS and URS strategies after 200 and 800 episodes respectively (see Figure 6). Convergence speed is high at the initial stage and then substantially decreases after approximately 1000 episodes.

Figure 6: Convergence of gradient MAB for one learning instance (showing first 3700 episodes).

In Figure 7 we illustrate the mean policy

Figure 7: Mean policy for the stochastic block model family .

The finally obtained policy (4) is then evaluated by applying it to 500 new i.i.d. realizations of the data graph, yielding the sampling sets , , and measuring the normalized mean squared error (NMSE) incurred by graph signal recovery from those sampling sets:

We perform similar measurements of the NMSE for random walk and random sampling algorithms under different sampling budgets and convert results to the logarithmic scale.

The Figure 8 shows that for relative sampling budget 0.2 improvement in NMSE amounts to 5 dBs in comparison to random sampling and 10 dBs in comparison to random walk approach. This gap increases even more for the sampling budget 0.4, to 8 dBs and 20 dBs respectively. The general tendency suggests further increase of the gap for larger sampling budgets.

Figure 8: Test set error obtained from graph signal recovery based on different sampling strategies.

5 Discussion

We now interpret the results and explain the poor performance of RWS using a simple argument based on the properties of Markov chains. For simplicity we consider a graph with clusters and having sizes and . The probability of having an edge between nodes in the same cluster is denoted , while the probability of having an edge between nodes in different clusters is . An elementary calculation yields the probability of a random walk transitioning from to as:

Likewise, the probability of staying in the :

We note that is the expected number of edges between a particular node of and and is the expected number of edges between a particular node of and the remaining nodes of . Similarly for :

The transition matrix of a Markov chain, which summarizes probabilistic transitions between clusters, can be formalized as follows:

Let be an equilibrium distribution [16] of the Markov chain which reflects amount of discrete time spent in and . According to theory of Markov chains [16] finding this distribution amounts to finding a vector such that:


It is easy to verify that solving the aforementioned system (5) yields the following equilibrium distribution:

We now consider particular example of a random graph with the configuration: . According to the presented above formulas, computations yield equilibrium distribution: , , which means that 95 % of discrete time of a random walk is spent in whereas only 5% of time is spent in . This rationale implies that upon termination of a random walk instance its endpoints will be located in clusters and with probabilities and respectively.

From the aforementioned examples we can conclude that although is only four times larger than , the probability of random walk termination within it is larger by a factor . Thus, the random walk sampling algorithm tends to oversample larger clusters and undersample smaller ones. This partially explains the poor performance of random walk in comparison to random sampling which samples clusters proportionally to their sizes.

6 Conclusions

This paper proposes a novel approach for graph signal processing which is based on interpreting graph signal sampling and recovery as a reinforcement learning problem. Using the lens of reinforcement learning lends naturally to an online sampling strategy which is based on determining an optimal policy which minimizes MSE of graph signal recovery. The proposed approach has been tested on a synthetic dataset generated in accordance to the stochastic block model. Obtained experimental results have confirmed effectiveness of the proposed sampling algorithm in the stochastic settings and demonstrated its advantages over existing approaches.


  1. G. B. Eslamlou, A. Jung, N. Goertz, and M. Fereydoon, “Graph signal recovery from incomplete and noisy information using approximate message passing,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2016, pp. 6170–6174.
  2. A. Jung, P. Berger, G. Hannak, and G. Matz, “Scalable graph signal recovery for big data over networks,” in Signal Processing Advances in Wireless Communications (SPAWC), 2016 IEEE 17th International Workshop on.   IEEE, 2016, pp. 1–6.
  3. A. Jung, A. Heimowitz, and Y. C. Eldar, “The network nullspace property for compressed sensing over networks,” in Sampling Theory and Applications (SampTA), 2017 International Conference on.   IEEE, 2017, pp. 644–648.
  4. A. Jung, N. Tran, and A. Mara, “When is network lasso accurate?” Front. Appl. Math. Stat. — doi: 10.3389/fams.2018.00009, vol. 3, p. 28, 2018. [Online]. Available:
  5. J. Sharpnack, A. Singh, and A. Rinaldo, “Sparsistency of the edge lasso over graphs,” in Artificial Intelligence and Statistics, 2012, pp. 1028–1036.
  6. A. Mara and A. Jung, “Recovery conditions and sampling strategies for network lasso,” in 2017 51st Asilomar Conference on Signals, Systems, and Computers, Oct 2017, pp. 405–409.
  7. A. Jung and M. Hulsebos, “The network nullspace property for compressed sensing of big data over networks,” Front. Appl. Math. Stat. — doi: 10.3389/fams.2018.00009, 2018.
  8. A. Jung, A. O. Hero III, A. Mara, and S. Aridhi, “Scalable semi-supervised learning over networks using nonsmooth convex optimization,” arXiv preprint arXiv:1611.00714, 2016.
  9. A. Chambolle and T. Pock, “A first-order primal-dual algorithm for convex problems with applications to imaging,” Journal of mathematical imaging and vision, vol. 40, no. 1, pp. 120–145, 2011.
  10. R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction, second edition (complete draft).   MIT press Cambridge, 2017, vol. 1, no. 1.
  11. T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude,” COURSERA: Neural networks for machine learning, vol. 4, no. 2, pp. 26–31, 2012.
  12. O. Besbes, Y. Gur, and A. Zeevi, “Stochastic multi-armed-bandit problem with non-stationary rewards,” in Advances in neural information processing systems, 2014, pp. 199–207.
  13. S. Basirian and A. Jung, “Random walk sampling for big data over networks,” 2017 International Conference on Sampling Theory and Applications (SampTA), pp. 427–431, 2017.
  14. G. Puy, N. Tremblay, R. Gribonval, and P. Vandergheynst, “Random sampling of bandlimited signals on graphs,” Applied and Computational Harmonic Analysis, 2016.
  15. E. Mossel, J. Neeman, and A. Sly, “Stochastic block models and reconstruction,” arXiv preprint arXiv:1202.1499, 2012.
  16. J. R. Norris, Markov chains.   Cambridge university press, 1998, no. 2.
This is a comment super asjknd jkasnjk adsnkj
The feedback cannot be empty
Comments 0
The feedback cannot be empty
Add comment

You’re adding your first comment!
How to quickly get a good reply:
  • Offer a constructive comment on the author work.
  • Add helpful links to code implementation or project page.