# Hardness of parameter estimation

in graphical models

###### Abstract

We consider the problem of learning the canonical parameters specifying an undirected graphical model (Markov random field) from the mean parameters. For graphical models representing a minimal exponential family, the canonical parameters are uniquely determined by the mean parameters, so the problem is feasible in principle. The goal of this paper is to investigate the computational feasibility of this statistical task. Our main result shows that parameter estimation is in general intractable: no algorithm can learn the canonical parameters of a generic pair-wise binary graphical model from the mean parameters in time bounded by a polynomial in the number of variables (unless RP = NP). Indeed, such a result has been believed to be true (see [1]) but no proof was known.

Our proof gives a polynomial time reduction from approximating the partition function of the hard-core model, known to be hard, to learning approximate parameters. Our reduction entails showing that the marginal polytope boundary has an inherent repulsive property, which validates an optimization procedure over the polytope that does not use any knowledge of its structure (as required by the ellipsoid method and others).

Hardness of parameter estimation

in graphical models

Guy Bresler^{1} David Gamarnik^{2} Devavrat Shah^{1}
Laboratory for Information and Decision Systems
Department of EECS^{1} and Sloan School of Management^{2}
Massachusetts Institute of Technology
{gbresler,gamarnik,devavrat}@mit.edu

## 1 Introduction

Graphical models are a powerful framework for succinct representation of complex high-dimensional distributions. As such, they are at the core of machine learning and artificial intelligence, and are used in a variety of applied fields including finance, signal processing, communications, biology, as well as the modeling of social and other complex networks. In this paper we focus on binary pairwise undirected graphical models, a rich class of models with wide applicability. This is a parametric family of probability distributions, and for the models we consider, the canonical parameters are uniquely determined by the vector of mean parameters, which consist of the node-wise and pairwise marginals.

Two primary statistical tasks pertaining to graphical models are inference and parameter estimation. A basic inference problem is the computation of marginals (or conditional probabilities) given the model, that is, the forward mapping . Conversely, the backward mapping corresponds to learning the canonical parameters from the mean parameters. The backward mapping is defined only for in the marginal polytope of realizable mean parameters, and this is important in what follows. The backward mapping captures maximum likelihood estimation of parameters; the study of the statistical properties of maximum likelihood estimation for exponential families is a classical and important subject.

In this paper we are interested in the computational tractability of these statistical tasks. A basic question is whether or not these maps can be computed efficiently (namely in time polynomial in the problem size). As far as inference goes, it is well known that approximating the forward map (inference) is computational hard in general. This was shown by Luby and Vigoda [2] for the hard-core model, a simple pairwise binary graphical model (defined in (2.1)). More recently, remarkably sharp results have been obtained, showing that computing the forward map for the hard-core model is tractable if and only if the system exhibits the correlation decay property [3, 4]. In contrast, to the best of our knowledge, no analogous hardness result exists for the backward mapping (parameter estimation), despite its seeming intractability [1].

Tangentially related hardness results have been previously obtained for the problem of learning the graph structure underlying an undirected graphical model. Bogdanov et al. [5] showed hardness of determining graph structure when there are hidden nodes, and Karger and Srebro [6] showed hardness of finding the maximum likelihood graph with a given treewidth. Computing the backward mapping, in comparison, requires estimation of the parameters when the graph is known.

Our main result, stated precisely in the next section, establishes hardness of approximating the backward mapping for the hard-core model. Thus, despite the problem being statistically feasible, it is computationally intractable.

The proof is by reduction, showing that the backward map can be used as a black box to efficiently estimate the partition function of the hard-core model. The reduction, described in Section 4, uses the variational characterization of the log-partition function as a constrained convex optimization over the marginal polytope of realizable mean parameters. The gradient of the function to be minimized is given by the backward mapping, and we use a projected gradient optimization method. Since approximating the partition function of the hard-core model is known to be computationally hard, the reduction implies hardness of approximating the backward map.

The main technical difficulty in carrying out the argument arises because the convex optimization is constrained to the marginal polytope, an intrinsically complicated object. Indeed, even determining membership (or evaluating the projection) to within a crude approximation of the polytope is NP-hard [7]. Nevertheless, we show that it is possible to do the optimization without using any knowledge of the polytope structure, as is normally required by ellipsoid, barrier, or projection methods. To this end, we prove that the polytope boundary has an inherent repulsive property that keeps the iterates inside the polytope without actually enforcing the constraint. The consequence of the boundary repulsion property is stated in Proposition 4.6 of Section 4, which is proved in Section 5.

Our reduction has a close connection to the variational approach to approximate inference [1]. There, the conjugate-dual representation of the log-partition function leads to a relaxed optimization problem defined over a tractable bound for the marginal polytope and with a simple surrogate to the entropy function. What our proof shows is that accurate approximation of the gradient of the entropy obviates the need to relax the marginal polytope.

We mention a related work of Kearns and Roughgarden [8] showing a polynomial-time reduction from inference to determining membership in the marginal polytope. Note that such a reduction does not establish hardness of parameter estimation: the empirical marginals obtained from samples are guaranteed to be in the marginal polytope, so an efficient algorithm could hypothetically exist for parameter estimation without contradicting the hardness of marginal polytope membership.

After completion of our manuscript, we learned that Montanari [9] has independently and simultaneously obtained similar results showing hardness of parameter estimation in graphical models from the mean parameters. His high-level approach is similar to ours, but the details differ substantially.

## 2 Main result

In order to establish hardness of learning parameters from marginals for pairwise binary graphical models, we focus on a specific instance of this class of graphical models, the hard-core model. Given a graph (where ), the collection of independent set vectors consist of vectors such that or (or both) for every edge . Each vector is the indicator vector of an independent set. The hard-core model assigns nonzero probability only to independent set vectors, with

(2.1) |

This is an exponential family with vector of sufficient statistics and vector of canonical parameters . In the statistical physics literature the model is usually parameterized in terms of node-wise fugacity (or activity) . The log-partition function

serves to normalize the distribution; note that is finite for all . Here and throughout, all logarithms are to the natural base.

The set of realizable mean parameters plays a major role in the paper, and is defined as

For the hard-core model (2.1), the set is a polytope equal to the convex hull of independent set vectors and is called the marginal polytope. The marginal polytope’s structure can be rather complex, and one indication of this is that the number of half-space inequalities needed to represent can be very large, depending on the structure of the graph underlying the model [10, 11].

The model (2.1) is a regular minimal exponential family, so for each in the interior of the marginal polytope there corresponds a unique satisfying the dual matching condition

We are concerned with approximation of the backward mapping , and we use the following notion of approximation.

###### Definition 2.1.

We say that is a -approximation to if . A vector is a -approximation to if each entry is a -approximation to .

We next define the appropriate notion of efficient approximation algorithm.

###### Definition 2.2.

A fully polynomial randomized approximation scheme (FPRAS) for a mapping is a randomized algorithm that for each and input , with probability at least outputs a -approximation to and moreover the running time is bounded by a polynomial .

Our result uses the complexity classes RP and NP, defined precisely in any complexity text (such as [12]). The class RP consists of problems solvable by efficient (randomized polynomial) algorithms, and NP consists of many seemingly difficult problems with no known efficient algorithms. It is widely believed that . Assuming this, our result says that there cannot be an efficient approximation algorithm for the backward mapping in the hard-core model (and thus also for the more general class of binary pairwise graphical models).

We recall that approximating the backward mapping entails taking a vector as input and producing an approximation of the corresponding vector of canonical parameters as output. It should be noted that even determining whether a given vector belongs to the marginal polytope is known to be an NP-hard problem [7]. However, our result shows that the problem is NP-hard even if the input vector is known a priori to be an element of the marginal polytope .

###### Theorem 2.3.

Assuming , there does not exist an FPRAS for the backward mapping .

As discussed in the introduction, Theorem 2.3 is proved by showing that the backward mapping can be used as a black-box to efficiently estimate the partition function of the hard core model, known to be hard. This uses the variational characterization of the log-partition function as well as a projected gradient optimization method. Proving validity of the projected gradient method requires overcoming a substantial technical challenge: we show that the iterates remain within the marginal polytope without explicitly enforcing this (in particular, we do not project onto the polytope). The bulk of the paper is devoted to establishing this fact, which may be of independent interest.

## 3 Background

### 3.1 Exponential families and conjugate duality

We now provide background on exponential families (as can be found in the monograph by Wainwright and Jordan [1]) specialized to the hard-core model (2.1) on a fixed graph . General theory on conjugate duality justifying the statements of this subsection can be found in Rockafellar’s book [13].

The basic relationship between the canonical and mean parameters is expressed via conjugate (or Fenchel) duality. The conjugate dual of the log-partition function is

Note that for our model is finite for all and furthermore the supremum is uniquely attained. On the interior of the marginal polytope, is the entropy function. The log-partition function can then be expressed as

(3.1) |

with

(3.2) |

The forward mapping is specified by the variational characterization (3.2) or alternatively by the gradient map .

As mentioned earlier, for each in the interior there is a unique satisfying the dual matching condition .

For mean parameters , the backward mapping to the canonical parameters is given by

or by the gradient

The latter representation will be the more useful one for us.

### 3.2 Hardness of inference

We describe an existing result on the hardness of inference and state the corollary we will use. The result says that, subject to widely believed conjectures in computational complexity, no efficient algorithm exists for approximating the partition function of certain hard-core models. Recall that the hard-core model with fugacity is given by (2.1) with for each .

###### Theorem 3.1 ([3, 4]).

Suppose and . Assuming , there exists no FPRAS for computing the partition function of the hard-core model with fugacity on regular graphs of degree . In particular, no FPRAS exists when and .

We remark that the source of hardness is the long-range dependence property of the hard-core model for . It was shown in [14] that for the model exhibits decay of correlations and there is an FPRAS for the log-partition function (in fact there is a deterministic approximation scheme as well). We note that a number of hardness results are known for the hardcore and Ising models, including [15, 16, 3, 2, 4, 17, 18, 19]. The result stated in Theorem 3.1 suffices for our purposes.

From this section we will need only the following corollary, proved in the Appendix. The proof, standard in the literature, uses the self-reducibility of the hard-core model to express the partition function in terms of marginals computed on subgraphs.

## 4 Reduction by optimizing over the marginal polytope

In this section we describe our reduction and prove Theorem 2.3. We define polynomial constants

(4.1) |

which we will leave as , , and to clarify the calculations. Also, given the asymptotic nature of the results, we assume that is larger than a universal constant so that certain inequalities are satisfied.

###### Proposition 4.1.

Fix a graph on nodes. Let be a black box giving a -approximation for the backward mapping for the hard-core model (2.1). Using calls to , and computation bounded by a polynomial in , it is possible to produce a -approximation to the marginals corresponding to all zero parameters.

We first observe that Theorem 2.3 follows almost immediately.

###### Proof of Theorem 2.3.

A standard median amplification trick (see e.g. [20]) allows to decrease the probability of erroneous output by a FPRAS to below using function calls. Thus the assumed FPRAS for the backward mapping can be made to give a -approximation to on successive calls, with probability of no erroneous outputs equal to at least . By taking in Proposition 4.1 we get a -approximation to with computation bounded by a polynomial in . In other words, the existence of an FPRAS for the mapping gives an FPRAS for the marginals , and by Corollary 3.2 this is not possible if . ∎

We now work towards proving Proposition 4.1, the goal being to estimate the vector of marginals for some fixed graph . The desired marginals are given by the solution to the optimization (3.2) with :

(4.2) |

We know from Section 3 that for the gradient , that is, the backward mapping amounts to a gradient first order (gradient) oracle. A natural approach to solving the optimization problem (4.2) is to use a projected gradient method. For reasons that will be come clear later, instead of projecting onto the marginal polytope , we project onto the shrunken marginal polytope defined as

(4.3) |

where is the th standard basis vector.

As mentioned before, projecting onto is NP-hard, and this must therefore be avoided if we are to obtain a polynomial-time reduction. Nevertheless, we temporarily assume that it is possible to do the projection and address this difficulty later. With this in mind, we propose to solve the optimization (4.2) by a projected gradient method with fixed step size ,

(4.4) |

In order for the method (4.4) to succeed a first requirement is that the optimum is inside . The following lemma is proved in the Appendix.

###### Lemma 4.2.

Consider the hard core model (2.1) on a graph with maximum degree on nodes and canonical parameters . Then the corresponding vector of mean parameters is in .

One of the benefits of operating within is that the gradient is bounded by a polynomial in , and this will allow the optimization procedure to converge in a polynomial number of steps. The following lemma amounts to a rephrasing of Lemmas 5.3 and 5.4 in Section 5 and the proof is omitted.

###### Lemma 4.3.

We have the gradient bound for any .

Next, we state general conditions under which an approximate projected gradient algorithm converges quickly. Better convergence rates are possible using the strong convexity of (shown in Lemma 4.5 below), but this lemma suffices for our purposes. The proof is standard (see [21] or Theorem 3.1 in [22] for a similar statement) and is given in the Appendix for completeness.

###### Lemma 4.4 (Projected gradient method).

Let be a convex function defined over a compact convex set with minimizer . Suppose we have access to an approximate gradient oracle for with error bounded as . Let . Consider the projected gradient method starting at and with fixed step size . After iterations the average satisfies .

To translate accuracy in approximating the function to approximating , we use the fact that is strongly convex. The proof (in the Appendix) uses the equivalence between strong convexity of and strong smoothness of the Fenchel dual , the latter being easy to check. Since we only require the implication of the lemma, we defer the definitions of strong convexity and strong smoothness to the appendix where they are used.

###### Lemma 4.5.

The function is -strongly convex. As a consequence, if for and , then .

At this point all the ingredients are in place to show that the updates (4.4) rapidly approach , but a crucial difficulty remains to be overcome. The assumed black box for approximating the mapping is only defined for inside , and thus it is not at all obvious how to evaluate the projection onto the closely related polytope . Indeed, as shown in [7], even approximate projection onto is NP-hard, and no polynomial time reduction can require projecting onto (assuming ).

The goal of the subsequent Section 5 is to prove Proposition 4.6 below, which states that the optimization procedure can be carried out without any knowledge about or . Specifically, we show that thresholding coordinates suffices, that is, instead of projecting onto we may project onto the translated non-negative orthant . Writing for this projection, we show that the original projected gradient method (4.4) has identical iterates as the much simpler update rule

(4.5) |

###### Proposition 4.6.

Choose constants as per (4.1). Suppose , and consider the iterates for , where is a -approximation of for all . Then , for all , and thus the iterates are the same using either or .

The next section is devoted to the proof of Proposition 4.6. We now complete the reduction.

###### Proof of Proposition 4.1.

We start the gradient update procedure at the point , which we claim is within for any graph for large enough. To see this, note that is in , because it is a convex combination (with weight each) of the independent set vectors . Hence , and additionally , for all .

We establish that for each by induction, having verified the base case in the preceding paragraph. Let for some . At iteration of the update rule we make a call to the black box giving a -approximation to the backward mapping , compute , and then project onto . Proposition 4.6 ensures that . Therefore, the update is the same as .

Lemma 4.5 implies that , and since , we get the entry-wise bound for each . Hence is a -approximation for . ∎

## 5 Proof of Proposition 4.6

In Subsection 5.1 we prove estimates on the parameters corresponding to close to the boundary of , and then in Subsection 5.2 we use these estimates to show that the boundary of has a certain repulsive property that keeps the iterates inside.

### 5.1 Bounds on gradient

We start by introducing some helpful notation. For a node , let denote its neighbors. We partition the collection of independent set vectors as

where

For a collection of independent set vectors we write as shorthand for and

We can then write the marginal at node as , and since partition , the space of all independent sets of , . For each let

The following lemma specifies a condition on and that implies a lower bound on .

###### Lemma 5.1.

If and for , then .

###### Proof.

Let , and observe that . We want to show that .

We now use the preceding lemma to show that if a coordinate is close to the boundary of the shrunken marginal polytope , then the corresponding parameter is large.

###### Lemma 5.2.

Let be a positive real number. If and , then .

###### Proof.

We would like to apply Lemma 5.1 with and , which requires showing that (a) and (b) . To show (a), note that if , then by definition of . It follows that .

We now show (b). Since , , and , (b) is equivalent to . We assume that and suppose for the sake of contradiction that . Writing for , so that , we define a new probability measure

One can check that has for each and . The point , being a convex combination of independent set vectors, must be in , and hence so must . But this contradicts the hypothesis and completes the proof of the lemma. ∎

The proofs of the next two lemmas are similar in spirit to Lemma 8 in [23] and are proved in the Appendix. The first lemma gives an upper bound on the parameters corresponding to an arbitrary point in .

###### Lemma 5.3.

If , then . Hence if , then for all .

The next lemma shows that if a component is not too small, the corresponding parameter is also not too negative. As before, this allows to bound from below the parameters corresponding to an arbitrary point in .

###### Lemma 5.4.

If , then . Hence if , then for all .

### 5.2 Finishing the proof of Proposition 4.6

We sketch the remainder of the proof here; full detail is given in Section D of the Supplement.

Starting with an arbitrary in , our goal is to show that remains in . The proof will then follow by induction, because our initial point is in by the hypothesis.

The argument considers separately each hyperplane constraint for of the form . The distance of from the hyperplane is . Now, the definition of implies that if , then for all coordinates , and thus for all constraints. We call a constraint critical if , and active if . For there are no critical constraints, but there may be active constraints.

We first show that inactive constraints can at worst become active for the next iterate , which requires only that the step-size is not too large relative to the magnitude of the gradient (Lemma 4.3 gives the desired bound). Then we show (using the gradient estimates from Lemmas 5.2, 5.3, and 5.4) that the active constraints have a repulsive property and that is no closer than to any active constraint, that is, . The argument requires care, because the projection may prevent coordinates from decreasing despite being very negative if is already small. These arguments together show that remains in , completing the proof.

## 6 Discussion

This paper addresses the computational tractability of parameter estimation for the hard-core model. Our main result shows hardness of approximating the backward mapping to within a small polynomial factor. This is a fairly stringent form of approximation, and it would be interesting to strengthen the result to show hardness even for a weaker form of approximation. A possible goal would be to show that there exists a universal constant such that approximation of the backward mapping to within a factor in each coordinate is NP-hard.

#### Acknowledgments

GB thanks Sahand Negahban for helpful discussions. Also we thank Andrea Montanari for sharing his unpublished manuscript [9]. This work was supported in part by NSF grants CMMI-1335155 and CNS-1161964, and by Army Research Office MURI Award W911NF-11-1-0036.

## References

- [1] M. Wainwright and M. Jordan, “Graphical models, exponential families, and variational inference,” Foundations and Trends in Machine Learning, vol. 1, no. 1-2, pp. 1–305, 2008.
- [2] M. Luby and E. Vigoda, “Fast convergence of the glauber dynamics for sampling independent sets,” Random Structures and Algorithms, vol. 15, no. 3-4, pp. 229–241, 1999.
- [3] A. Sly and N. Sun, “The computational hardness of counting in two-spin models on d-regular graphs,” in FOCS, pp. 361–369, IEEE, 2012.
- [4] A. Galanis, D. Stefankovic, and E. Vigoda, “Inapproximability of the partition function for the antiferromagnetic Ising and hard-core models,” arXiv preprint arXiv:1203.2226, 2012.
- [5] A. Bogdanov, E. Mossel, and S. Vadhan, “The complexity of distinguishing Markov random fields,” Approximation, Randomization and Combinatorial Optimization, pp. 331–342, 2008.
- [6] D. Karger and N. Srebro, “Learning Markov networks: Maximum bounded tree-width graphs,” in Symposium on Discrete Algorithms (SODA), pp. 392–401, 2001.
- [7] D. Shah, D. N. Tse, and J. N. Tsitsiklis, “Hardness of low delay network scheduling,” Information Theory, IEEE Transactions on, vol. 57, no. 12, pp. 7810–7817, 2011.
- [8] T. Roughgarden and M. Kearns, “Marginals-to-models reducibility,” in Advances in Neural Information Processing Systems, pp. 1043–1051, 2013.
- [9] A. Montanari, “Computational implications of reducing data to sufficient statistics.” unpublished, 2014.
- [10] M. Deza and M. Laurent, Geometry of cuts and metrics. Springer, 1997.
- [11] G. M. Ziegler, “Lectures on 0/1-polytopes,” in Polytopes—combinatorics and computation, pp. 1–41, Springer, 2000.
- [12] C. H. Papadimitriou, Computational complexity. John Wiley and Sons Ltd., 2003.
- [13] R. T. Rockafellar, Convex analysis, vol. 28. Princeton university press, 1997.
- [14] D. Weitz, “Counting independent sets up to the tree threshold,” in Proceedings of the thirty-eighth annual ACM symposium on Theory of computing, pp. 140–149, ACM, 2006.
- [15] M. Dyer, A. Frieze, and M. Jerrum, “On counting independent sets in sparse graphs,” SIAM Journal on Computing, vol. 31, no. 5, pp. 1527–1541, 2002.
- [16] A. Sly, “Computational transition at the uniqueness threshold,” in FOCS, pp. 287–296, 2010.
- [17] F. Jaeger, D. Vertigan, and D. Welsh, “On the computational complexity of the jones and tutte polynomials,” Math. Proc. Cambridge Philos. Soc, vol. 108, no. 1, pp. 35–53, 1990.
- [18] M. Jerrum and A. Sinclair, “Polynomial-time approximation algorithms for the Ising model,” SIAM Journal on computing, vol. 22, no. 5, pp. 1087–1116, 1993.
- [19] S. Istrail, “Statistical mechanics, three-dimensionality and NP-completeness: I. universality of intracatability for the partition function of the Ising model across non-planar surfaces,” in STOC, pp. 87–96, ACM, 2000.
- [20] M. R. Jerrum, L. G. Valiant, and V. V. Vazirani, “Random generation of combinatorial structures from a uniform distribution,” Theoretical Computer Science, vol. 43, pp. 169–188, 1986.
- [21] Y. Nesterov, Introductory lectures on convex optimization: A basic course, vol. 87. Springer, 2004.
- [22] S. Bubeck, “Theory of convex optimization for machine learning.” Available at http://www.princeton.edu/ sbubeck/pub.html.
- [23] L. Jiang, D. Shah, J. Shin, and J. Walrand, “Distributed random access algorithm: scheduling and congestion control,” IEEE Trans. on Info. Theory, vol. 56, no. 12, pp. 6182–6207, 2010.
- [24] D. P. Bertsekas, Nonlinear programming. Athena Scientific, 1999.
- [25] S. M. Kakade, S. Shalev-Shwartz, and A. Tewari, “Regularization techniques for learning with matrices,” J. Mach. Learn. Res., vol. 13, pp. 1865–1890, June 2012.
- [26] J. M. Borwein and J. D. Vanderwerff, Convex functions: constructions, characterizations and counterexamples. No. 109, Cambridge University Press, 2010.

## Supplementary Material

## Appendix A Miscellaneous proofs

### a.1 Proof of Corollary 3.2

The proof is standard and uses the self-reducibility of the hard-core model, meaning that conditioning on amounts to removing node from the graph. Fix a graph and parameters . We show that given an algorithm to approximately compute the marginals for induced subgraphs , it is possible to approximate the partition function , denoted here by . We first claim that

(A.1) |

The graph is obtained by removing nodes labeled , and is the marginal at node for this graph. We use induction on the number of nodes. The base case with one node is trivial: . Suppose now that the formula (A.1) holds for graphs on nodes and that . Let and denote the partition function summation restricted to or , respectively. Thus

Now is the partition function of a new graph obtained by deleting vertex , and the inductive assumption proves the formula.

From (A.1) we see that in order to compute a -approximation to , it suffices to compute a approximation to each of the marginals. Now for small , a approximation to gives a approximation to , and this completes the proof.

### a.2 Proof of Lemma 4.2

We wish to show that for a graph of maximum degree and . Consider a particular node with neighbors , and let denote its degree. We use the notation defined in Subsection 5.1. A collection of independent set vectors is assigned probability for our choice , so it suffices to argue about cardinalities.

We first claim that . This follows by observing that each set in gets mapped to a set in by removing the neighbors , and moreover at most sets are mapped to the same set in . Next, we note that since the removal of node is a bijection from to and hence they are of the same cardinality. Combining these observations with the fact that , we get the estimate .

Next, we show for each coordinate that the vector is in , which will complete the proof that is . Let denote the probability assigned to under the distribution with parameters , so that . Similarly to the proof of Lemma 5.2, we define a new probability measure

This is a valid probability distribution because for . One can check that has for each and . The point , being a convex combination of independent set vectors, must be in , and hence so must .

## Appendix B Proofs for projected gradient method

### b.1 Proof of Lemma 4.4

The proof here is a slight modification of the proof of Theorem 3.1 in [22].

Observe first that if is the projection onto a convex set, then is a contraction: (cf. Prop 2.1.3 in [24]). Using the the convexity inequality , the definition , and the update formula , it follows that

Adding the preceding inequality for to , the sum telescopes and we get

(B.1) |

Here we used the definitions and and the last equality is by the choice . Now defining , dividing (B.1) through by and using the convexity of to apply Jensen’s inequality gives

Thus in order to make the right hand side smaller than it suffices to take and .

### b.2 Proof of Lemma 4.5

We start by showing that the gradient is -Lipschitz. Recall that . We prove a bound on by changing one coordinate of at a time. Let . The triangle inequality gives

A direct calculation shows that

Since this is uniformly bounded by one in absolute value, we obtain the inequality or

Hence

i.e., is -Lipschitz.

Now the function being -Lipschitz implies that is -strongly smooth, where is -strongly smooth if

To see this, we write

## Appendix C Proofs of gradient bounds

### c.1 Proof of Lemma 5.3

We suppose for the sake of deriving a contradiction that . Let , and let be a probability measure such that . Now , and we define the non-negative measure (summing to less than one) with support as

In this way, and . We define a new probability measure

(C.1) |

and one may check that and . We use the definitions in Subsection 5.1 to get