Non-asymptotic analysis of an optimal algorithm for network-constrained averaging with noisy links

# Non-asymptotic analysis of an optimal algorithm for network-constrained averaging with noisy links

## Abstract

The problem of network-constrained averaging is to compute the average of a set of values distributed throughout a graph using an algorithm that can pass messages only along graph edges. We study this problem in the noisy setting, in which the communication along each link is modeled by an additive white Gaussian noise channel. We propose a two-phase decentralized algorithm, and we use stochastic approximation methods in conjunction with the spectral graph theory to provide concrete (non-asymptotic) bounds on the mean-squared error. Having found such bounds, we analyze how the number of iterations required to achieve mean-squared error scales as a function of the graph topology and the number of nodes . Previous work provided guarantees with the number of iterations scaling inversely with the second smallest eigenvalue of the Laplacian. This paper gives an algorithm that reduces this graph dependence to the graph diameter, which is the best scaling possible.

## I Introduction

The problem of network-constrained averaging is to compute the average of a set of numbers distributed throughout a network, using an algorithm that is allowed to pass messages only along edges of the graph. Motivating applications include sensor networks, in which individual motes have limited memory and communication ability, and massive databases and server farms, in which memory constraints preclude storing all data at a central location. In typical applications, the average might represent a statistical estimate of some physical quantity (e.g., temperature, pressure etc.), or an intermediate quantity in a more complex algorithm (e.g., for distributed optimization). There is now an extensive literature on network-averaging, consensus problems, as well as distributed optimization and estimation (e.g., see the papers [7, 12, 10, 30, 20, 3, 4, 8, 23, 22]). The bulk of the earlier work has focused on the noiseless variant, in which communication between nodes in the graph is assumed to be noiseless. A more recent line of work has studied versions of the problem with noisy communication links (e.g., see the papers [18, 15, 27, 2, 29, 19, 24] and references therein).

The focus of this paper is a noisy version of network-constrained averaging in which inter-node communication is modeled by an additive white Gaussian noise (AWGN) channel. Given this randomness, any algorithm is necessarily stochastic, and the corresponding sequence of random variables can be analyzed in various ways. The simplest question to ask is whether the algorithm is consistent—that is, does it compute an approximate average or achieve consensus in an asymptotic sense for a given fixed graph? A more refined analysis seeks to provide information about this convergence rate. In this paper, we do so by posing the following question: for a given algorithm, how does number of iterations required to compute the average to within -accuracy scale as a function of the graph topology and number of nodes ? For obvious reasons, we refer to this as the network scaling of an algorithm, and we are interested in finding an algorithm that has near-optimal scaling law.

The issue of network scaling has been studied by a number of authors in the noiseless setting, in which the communication between nodes is perfect. Of particular relevance here is the work of Benezit et al. [5], who in the case of perfect communication, provided a scheme that has essentially optimal message scaling law for random geometric graphs. A portion of the method proposed in this paper is inspired by their scheme, albeit with suitable extensions to multiple paths that are essential in the noisy setting. The issue of network scaling has also been studied in the noisy setting; in particular, past work by Rajagopal and Wainwright [27] analyzed a damped version of the usual consensus updates, and provided scalings of the iteration number as a function of the graph topology and size. However, our new algorithm has much better scaling than the method [27].

The main contributions of this paper are the development of a novel two-phase algorithm for network-constrained averaging with noise, and establishing the near-optimality of its network scaling. At a high level, the outer phase of our algorithm produces a sequence of iterates based on a recursive linear update with decaying step size, as in stochastic approximation methods. The system matrix in this update is a time-varying and random quantity, whose structure is determined by the updates within the inner phase. These inner rounds are based on establishing multiple paths between pairs of nodes, and averaging along them simultaneously. By combining a careful analysis of the spectral properties of this random matrix with stochastic approximation theory, we prove that this two-phase algorithm computes a -accurate version of the average using a number of iterations that grows with the graph diameter (up to logarithmic factors).1 As we discuss in more detail following the statement of our main result, this result is optimal up to logarithmic factors, meaning that no algorithm can be substantially better in terms of network scaling.

The remainder of this paper is organized as follows. We begin in Section II with background and formulation of the problem. In Section III, we describe our algorithm, and state various theoretical guarantees on its performance. We then provide the proof of our main result in Section IV. Section V is devoted to some simulation results that confirm the sharpness of our theoretical predictions. We conclude the paper in Section VI.

Notation: For the reader’s convenience, we collect here some notation used throughout the paper. The notation means that there exists some constant and such for all , whereas means that for all . The notation means that and . Given a symmetric matrix , we denote its ordered sequence of eigenvalues by and also its -operator norm by . Finally we use to denote the Euclidean inner product.

## Ii Background and problem set-up

We begin in this section by introducing necessary background and setting up the problem more precisely.

### Ii-a Network-constrained averaging

Consider a collection of numbers. In statistical settings, these numbers would be modeled as identically distributed (i.i.d.) draws from an unknown distribution with mean . In a centralized setting, a standard estimator for the mean is the sample average . When all of the data can be aggregated at a central location, then computation of is straightforward. In this paper, we consider the network-constrained version of this estimation problem, modeled by an undirected graph that consists of a vertex set , and a collection of edges joining pairs of vertices. For , we view each measurement as associated with vertex . (For instance, in the context of sensor networks, each vertex would contain a mote and collect observations of the environment.) The edge structure of the graph enforces communication constraints on the processing: in particular, the presence of edge indicates that it is possible for sensors and to exchange information via a noisy communication channel. Conversely, sensor pairs that are not joined by an edge are not permitted to communicate directly.2 Every node has a synchronized internal clock, and acts at discrete times . For any given pair of sensors , we assume that the message sent from to is perturbed by an independent identically distributed variate. Although this additive white Gaussian noise (AWGN) model is more realistic than a noiseless model, it is conceivable (as pointed out by one of the reviewers) that other stochastic channel models might be more suitable for certain types of sensor networks, and we leave this exploration for future research.

Given this set-up, of interest to us are stochastic algorithms that generate sequences of iterates contained within , and we require that the algorithm be graph-respecting, meaning that in each iteration, it is allowed to send at most one message for each direction of every edge . At time , we measure the distance between and the desired average via the average (per node) mean-squared error, given by

 MSE(θ(t)) :=1nn∑i=1E[(θi(t)−\makebox[0.0pt][l]θ)2]. (1)

In this paper, our goal is for every node to compute the average up to an error tolerance . In addition, we require almost sure consensus among nodes, meaning

 P[θi(t)=θj(t)∀i,j=1,2,⋯,n]→1as t→∞.

Our primary goal is in characterizing the rate of convergence as a function of the graph topology and the number of nodes, to which we refer as the network-scaling function of the algorithm. More precisely, in order to study this network scaling, we consider sequences of graphs indexed by the number of nodes . For any given algorithm (defined for each graph ) and a fixed tolerance parameter , our goal is to determine bounds on the quantity

 TG(n;δ) :=inf{t=1,2,…∣MSE(θ(t))≤δ}. (2)

Note that is a stopping time, given by the smallest number of iterations required to obtain mean-squared error less than on a graph of type with nodes.

### Ii-B Graph topologies

Of course, the question that we have posed will depend on the graph type, and this paper analyzes three types of graphs, as shown in Figure 1. The first two graphs have regular topologies: the single cycle graph in panel (a) is degree two-regular, and the two-dimensional grid graph in panel (b) is degree four-regular. In addition, we also analyze an important class of random graphs with irregular topology, namely the class of random geometric graphs. As illustrated in Figure 1(c), a random geometric graph (RGG) in the plane is formed according by placing nodes uniformly at random in the unit square , and the connecting two nodes if their Euclidean distance is less than some radius . It is known that an RGG will be connected with high probability as long as ; see Penrose [26] for discussion of this and other properties of random geometric graphs.

A key graph-theoretic parameter relevant to our analysis is the graph diameter, denoted by . The path distance between any pair of nodes is the length of the shortest path joining them in the graph, and by definition, the graph diameter is the maximum path distance taken over all node pairs in the graph. It is straightforward to see that for the single cycle graph, and that for the two-dimensional grid. For a random geometric graph with radius chosen to ensure connectivity, it is known that .

Finally, in order to simplify the routing problem explained later, we divide the unit square into subregions (squares) of side length in case of grid, and for some constant , of side length in case of RGG. We assume that each node knows its location and is aware of the center of these subregions namely , where for the regular grid, and for the RGG. As a convention, we assume that is the left bottom square, to which we refer to as the first square. By construction, in a regular grid, each square will contain one and only one node which is located at the center of the square. From known properties of RGGs [26, 17], each of the given subregions will contain at least one node with high probability (w.h.p.). Moreover, an RGG is regular w.h.p, meaning that each square contains nodes (see Lemma 1 in the paper [12]). Accordingly, in the remainder of the paper, we assume without loss of generality that any given RGG is regular. Note that by construction, the transmission radius is selected so that each node in each square is connected to every other node in four adjacent squares.

## Iii Algorithm and its properties

In this section we state our main result which is followed by a detailed description of the proposed algorithm.

### Iii-a Theoretical guarantees

Our main result guarantees the existence of a graph-respecting algorithm with desirable properties. Recall the definition of the graph respecting scheme, as well as the definition of our AWGN channel model given in Section II. In the following statement, the quantity denotes a universal constant, independent of , , and .

###### Theorem 1.

For the communication model in which each link is an AWGN channel with variance , there is a graph-respecting algorithm such that:

1. Nodes almost surely reach a consensus. More precisely, we have

 θ(t)\lx@stackrela.s.⟶˜θ→1as t→∞, (3)

for some .

2. After iterations, the algorithm satisfy the following bounds on the :

1. For fixed tolerance sufficiently small, we have after

 Tcyc(n;δ)≤c0nmax{1δlog1δ,MSE(θ(0))σ2δ2}

iterations for a single cycle graph.

2. For fixed tolerance sufficiently small, we have after

 Tgrid(n;δ)≤c0√nmax{1δlog1δ,MSE(θ(0))σ2δ2}

iterations for the regular grid in two dimensions.

3. Assume that , for some fixed sufficiently small. Then we have after

 TRGG(n;δ)≤c0√n(logn)3max{1˜δlog(logn)2˜δ,MSE(θ(0))σ2˜δ2}

iterations for a regular random geometric graph.

Here is some constant independent of , , and , whose value may change from line to line.

Remarks: A few comments are in order regarding the interpretation of this result. First, it is worth mentioning that the quality of the different links does not have to be the same. Similar arguments apply to the case where noises have different variances. Second, although nodes almost surely reach a consensus, as guaranteed in part (a), this consensus value is not necessarily the same as the sample mean . The choice of is intentional to emphasize this point. However, as guaranteed by part (b), this consensus value is within distance of the actual sample mean. Since the sample mean itself represents a noisy estimate of some underlying population quantity, there is little point to computing it to arbitrary accuracy. Third, it is worthwhile comparing part (b) with previous results on network scaling in the noisy setting. Rajagopal and Wainwright [27] analyzed a simple set of damped updates, and showed that for the single cycle, and that for the two-dimensional grid. By comparison, the algorithm proposed here and our analysis thereof has removed factors of and from this scaling.

### Iii-B Optimality of the results

As we now discuss, the scalings in Theorem 1 are optimal for the cases of cycle and grid and near-optimal (up to logarithmic factor) for the case of RGG. In an adversarial setting, any algorithm needs at least iterations, where denotes the graph diameter, in order to approximate the average; otherwise, some node will fail to have any information from some subset of other nodes (and their values can be set in a worst-case manner). Theorem 1 provides upper bounds on the number of iterations that, at most, are within logarithmic factors of the diameter, and hence are also within logarithmic factors of the optimal latency scaling law. For the graphs given here, the scalings are also optimal in a non-adversarial setting, in which are modeled as chosen i.i.d. from some distribution. Indeed, for a given node , and positive integer , we let denote the depth neighborhood of , meaning the set of nodes that are connected to by a path of length at most . We then define the graph spreading function . Note that the function is non-decreasing, so that we may define its inverse function . As some examples:

• for a cycle on nodes, we have , and hence .

• for a -grid in two dimensions, we have the upper bound , and hence the lower bound .

• for a random geometric graph (RGG), we have the upper bound , which implies the lower bound

After steps, a given node can gather the information of at most nodes. For the average based on nodes to be comparable to , we require that , and hence the iteration number should be at least . For the three graphs considered here, this leads to the same conclusion, namely that iterations are required. We note also that using information-theoretic techniques, Ayaso et al. [1] proved a lower bound on the number of iterations for a general graph in terms of the Cheeger constant [9]. For the graphs considered here, the Cheeger constant is of the order of the diameter.

### Iii-C Description of algorithm

We now describe the algorithm that achieves the bounds stated in Theorem 1. At the highest level, the algorithm can be divided into two types of phases: an inner phase, and an outer phase. The outer phase produces a sequence of iterates , where is the outer time scale parameter. By design of the algorithm, each update of the outer parameters requires a total of message-passing rounds (these rounds corresponding to the inner phase), where in each round the algorithm can pass at most two messages per edge (one for each direction). To put everything in a nutshell, the algorithm is based on establishing multiple routes, averaging along them in an inner phase and updating the estimates based on the noisy version of averages along routes in an outer phase. Consequently, if we use the estimate , then in the language of Theorem 1, it corresponds to rounds of message-passing. Our goal is to establish upper bounds on that guarantee the MSE is . Figure 2 illustrates the basic operations of the algorithm.

#### Outer phase

In the outer phase, we produce a sequence of iterates according to the recursive update

 θ(τ+1)=θ(τ)−ϵ(τ){L(τ)θ(τ)+v(τ)}. (4)

Here is a sequence of positive decreasing stepsizes. For a given precision, , we set . For each , the quantity is a random matrix, whose structure is determined by the inner phase, and is an additive Gaussian term, whose structure is also determined in the inner phase. As will become clear in the sequel, even though and are dependent, they are both independent of . Moreover, given , the random vector is Gaussian with bounded variance.

#### Inner phase

The inner phase is the core of the algorithm and it involves a number of steps, as we describe here. We use to index the iterations within any inner phase, and use to denote the sequence of inner iterates within . For the inner phase corresponding to outer update from , the inner phase takes the initialization , and then reduces as output to the outer iteration. In more detail, the inner phase can be broken down into three steps, which we now describe in detail.

Step 1, deciding the averaging direction The first step is to choose a direction in which to perform averaging. In a single cycle graph, since left and right are viewed as the same, there is only one choice, and hence nothing to be decided. In contrast, the grid or RGG graphs require a decision-making phase, which proceeds as follows. One node in the first (bottom left) square, wakes up and chooses uniformly at random to send in the horizontal or vertical direction. We code this decision using the random variable , where (respectively ) represents the horizontal (respectively vertical) direction. To simplify matters, we assume in the remainder of this description that the averaging direction is horizontal, with the modifications required for vertical averaging being standard.

Step 2, choosing the head nodes This step applies only to the grid and RGG graphs. Given our assumption that the node in the first square has chosen the horizontal direction, it then passes a token message to a randomly selected node in the above adjacent square. The purpose of this token is to determine which node (referred to as the head node) should be involved in establishing the route passing through the given square. After receiving the token, the receiving node passes it to another randomly selected node in the above adjacent square and so on. Note that in the special case of grid, there is only one node in each square, and so no choices are required within squares. After rounds, one node in each square () receives the token, as illustrated in Figure 3. Note that again in a single cycle graph, there is nothing to be decided, since the direction and head nodes are all determined.

Step 3, establishing routes and averaging In this phase, each of head nodes establishes a horizontal path, and then perform averaging along the path, as illustrated in Figure 3(b). This part of algorithm involves three substeps, which we now describe in detail.

• For , each head node selects a node uniformly at random (u.a.r.) from within the right adjacent square, and passes to it the quantity . Given the Gaussian noise model, node then receives the quantity

 ˜γ1j(1) =γ1j(1)+v1j,where v1j∼N(0,σ2),

and then updates its own local variable as . We then iterate this same procedure—that is, node selects another u.a.r. from its right adjacent square, and passes the message . Overall, at round of this update procedure, we have

 γ(i+1)j(i+1) =γ(i+1)j(i)+˜γij(i),

where , and . At the end of round , node can compute a noisy version of the average along the path , in particular via the rescaled quantity

 ηj:=γmj(m)m=1mm∑i=1θsij(t)+vjj=1,2,⋯,m.

Here the variable , since the noise variables associated with different edges are independent.

• At this point, for each , each node which has the noisy version, , of the path average along route ; can share this information with other nodes in the path by sending back to the head node. A naive way to do this is as follows: node makes copies of —namely, , —and starts transmitting one copy at a time back to the head node. Nodes along the path simply forward what they receive, so that after time steps, node receives noisy copies of the average, where . Averaging the copies, node can compute the quantity

 γij(3m−i−1) :=1mm∑l=1˜η(l)ij=1mm∑l=1θslj(τ)+wij,

where . Since the noise on different links and different time steps are independent Gaussian random variables, we have , with

 σ2i =1mσ2+(1−im)σ2=(1−(i−1)m)σ2≤σ2.

Therefore, at the end of rounds, for each , all nodes have the average of the estimates in the path that is perturbed by Gaussian noise with variance at most . Since , we have .

• At the end of the inner phase , nodes that were involved in a path use their estimate of the average along the path to update , while estimate of the nodes that were not involved in any route remain the same. A given node on a path updates its estimate via

 θsij(τ+1)={1−ϵ′(τ)}θsij(τ)+ϵ′(τ)γij(M), (5)

where . On the other hand, using to denote the Euclidean inner product, we have , where is the averaging vector of the route with the entries for , and zero otherwise. Combining the scalar updates (5) yields the matrix-form update

 θ(τ+1) =θ(τ)−ϵ′(τ){(I−W(τ))θ(τ)+v′(τ)}, (6)

where the matrix is a random averaging matrix induced by the choice of routes and the random directions . The noise vector is additive noise. Note that for any given time, the noise at different nodes are correlated via the matrix , but for different time instants , the noise vectors and are independent. Moreover, from our earlier arguments, we have the upper bound .

## Iv Proof of Theorem 1

We now turn to the proof of Theorem 1. At a high-level, the structure of the argument consists of decomposing the vector into a sum of two terms: a component within the consensus subspace (meaning all values of the vector are identical), and a component in the orthogonal complement. Using this decomposition, the mean-squared error splits into a sum of two terms and we use standard techniques to bound them. As will be shown, these bounds depend on the parameter , noise variance, the initial MSE, and finally the (inverse) spectral gap of the update matrix. The final step is to lower bound the spectral gap of our update matrix.

### Iv-a Setting up the proof

Recalling the averaging matrix from the update (6), we define the Laplacian matrix . We then define the average matrix , where the expectation is taken place over the randomness due to the choice of routes;3 in a similar way, we define the associated (average) Laplacian . Finally, we define the rescaled quantities

 ϵ(τ):=λ2(\makebox[0.0pt][l]S)ϵ′(τ),L(τ):=1λ2(\makebox[0.0pt][l]S)S(τ),andv(τ):=1λ2(\makebox[0.0pt][l]S)v′(τ), (7)

where we recall that denotes the second smallest eigenvalue of a symmetric matrix. In terms of these rescaled quantities, our algorithm has the form

 θ(τ+1)=θ(τ)−ϵ(τ)[L(τ)θ(τ)+v(τ)], (8)

as stated previously in the update equation (4). Moreover, by construction, we have where . We also, for theoretical convenience, set

 ϵ′(τ)=1λ2(¯S)(τ+1δ), (9)

or equivalently for .

We first claim that the matrix is symmetric and (doubly) stochastic. The symmetry follows from the fact that different routes do not collide, whereas the matrix is stochastic because every row of (depending on whether the node corresponding to that row participates in a route or not) either represents an averaging along a route or is the corresponding row of the identity matrix. Consequently, we can interpret as the transition matrix of a reversible Markov chain. It is an irreducible Markov chain, because within any updating round, there is a positive chance of averaging nodes that are in the same column or row, which implies that the associated Markov chain can transition from one state to any other in at most two steps. Moreover, the stationary distribution of the chain is uniform (i.e., ).

We now use these properties to simplify our study of the sequence generated by the update equation (8). Since is real and symmetric, it has the eigenvalue decomposition , where is a unitary matrix (that is, ). Moreover, we have , where is the eigenvalue corresponding to the eigenvector , for . Since , the eigenvalues of and are related via

 = =

Since the largest eigenvalue of an irreducible Markov chain is one (with multiplicity one) [16], we have , or equivalently

with . Moreover, we have , so that the first eigenvector corresponds to the eigenvalue . Let denote the matrix obtained from by deleting its first column, . Since the smallest eigenvalue of is zero, we may write , where , , and . With this notation, our analysis is based on the decomposition

 θ(τ) =α(τ)→1√n+˜Uβ(τ), (10)

where we have defined and . Since for all , from the decomposition (10) and the form of the updates (8), we have the following recursions,

 α(τ+1) =α(τ)−ϵ(τ)→1T√nv(τ),and (11)
 β(τ+1) =β(τ)−ϵ(τ)(L––(τ)β(τ)+˜UTv(τ)). (12)

Here is an matrix defined by the relation

 UTL(τ)U =[0→0T→0L––(τ)]n×n.

### Iv-B Main steps

As we show, part (a) of the theorem requires some intermediate results of the proof of part (b). Accordingly, we defer it to the end of the section. With this set-up, we now state the two main technical lemmas that form the core of Theorem 1. Our first lemma concerns the behavior of the component sequences and which evolve according to equations (11) and (12) respectively.

###### Lemma 2.

Given the random sequence generated by the update equation (4), we have

 MSE(θ(τ)) =1nvar(α(τ))e1(τ)+1nE[∥β(τ)∥22]e2(τ). (13)

Furthermore, and satisfy the following bounds:

1. For each iteration , we have

 e1(τ) ≤σ2δ[λ2(¯S)]2. (14)
2. Moreover, for each iteration we have

 e2(τ) ≤σ2[λ2(¯S)]2log(τ+1δ−1)τ+1δ−1+e2(0)1δ−1τ+1δ−1, (15)

From Lemma 2, we conclude that in order to guarantee an bound on the MSE, it suffices to take such that

 1δ−1τ+1δ−1≤σ2δe2(0)[λ2(¯S)]2,andlog(τ+1δ−1)τ+1δ−1≤δ.

Note that the first inequality is satisfied when . Moreover, doing a little bit of algebra, one can see that is sufficient to satisfy the second inequality. Accordingly, we take

 τ=max{2δlog1δ,e2(0)[λ2(¯S)]2σ2δ2}

outer iterations.

The last part of the proof is to bound the second smallest eigenvalue of the Laplacian matrix . The following lemma, which we prove in Section IV-D to follow, addresses this issue. Recall that denotes the second smallest eigenvalue of a matrix.

###### Lemma 3.

The averaged matrix that arises from our protocol has the following properties:

1. For a cycle and a regular grid we have , and

2. for a random geometric graph, we have , with high probability.

It is important to note that the averaged matrix is not the same as the graph Laplacian that would arise from standard averaging on these graphs. Rather, as a consequence of establishing many paths and averaging along them in each inner phase, our protocol ensures that the matrix behaves essentially like the graph Laplacian for the fully connected graph.

As established previously, each outer step requires iterations. Therefore, we have shown that it is sufficient to take a total of

 T=O(Dnmax{2δlog1δ,e2(0)[λ2(¯S)]2σ2δ2})

transmissions per edge in order to guarantee a bound on the MSE. As we will see in the next section, assuming that the initial values are fixed, we have , hence . The claims in Theorem 1 then follow by standard calculations of the diameters of the various graphs and the result of the Lemma 3.

It remains to prove the two technical results, Lemma 2 and 3, and we do so in the following sections.

### Iv-C Proof of Lemma 2

We begin by observing that

 E[(θ(τ)−¯θ→1)(θ(τ)−¯θ→1)T] =F1+F2+F3,

where , the second term is given by , and

 F3:=E[(α(τ)−√n¯θ)→1√nβ(τ)T˜UT]+E[(α(τ)−√n¯θ)˜Uβ(τ)→1T√n].

Since has orthonormal columns, all orthogonal to the all one vector (), it follows that , and .

It remains to compute . Unwrapping the recursion (11) and using the fact that initialization implies yields

 α(τ) (16)

for all . Since , , are zero mean random vectors, from equation (16) we conclude that 4 and accordingly, . Recalling the definition of the MSE (1) and combining the pieces yields the claim (13).

(a) From equation (16), it is clear that each is Gaussian with mean . It remains to bound the variance. Using the i.i.d. nature of the sequence , we have

 var(α(τ)) =E[(τ−1∑l=0ϵ(l)⟨→1√n,v(l)⟩)2] =τ−1∑l=0ϵ(l)2n⟨→1,C→1⟩ =τ−1∑l=0ϵ′(l)2⟨→1,C′→1⟩n,

where we have recalled the rescaled quantities (7). Recalling the fact that and using the Cauchy-Schwarz inequality, we have . Hence, we obtain

 var(α(τ)) ≤nσ2τ−1∑l=0ϵ′(l)2 =nσ2[λ2(¯S)]2τ−1∑l=01(1δ+l)2 ≤nσ2[λ2(¯S)]2∫∞1δ1x2dx=nσ2δ[λ2(¯S)]2;

from which rescaling by establishes the bound (14).

(b) Defining , the update equation (12) can be written as

 β(τ+1)=β(τ)−ϵ(τ)H(β(τ),v(τ)),

for . In order to upper bound , defined in (13), we need to control . Doing some algebra yields

 e2(τ+1)−e2(τ) =1nE[⟨β(τ+1)−β(τ),β(τ+1)+β(τ)⟩] =1nE[⟨−ϵ(τ)H(β(τ,v(τ))),−ϵ(τ)H(β(τ,v(τ)