Minimization Problems Based on
Relative Entropy I: Forward Projection
Abstract
Minimization problems with respect to a oneparameter family of generalized relative entropies are studied. These relative entropies, which we term relative entropies (denoted ), arise as redundancies under mismatched compression when cumulants of compressed lengths are considered instead of expected compressed lengths. These parametric relative entropies are a generalization of the usual relative entropy (KullbackLeibler divergence). Just like relative entropy, these relative entropies behave like squared Euclidean distance and satisfy the Pythagorean property. Minimizers of these relative entropies on closed and convex sets are shown to exist. Such minimizations generalize the maximum Rényi or Tsallis entropy principle. The minimizing probability distribution (termed forward projection) for a linear family is shown to obey a powerlaw. Other results in connection with statistical inference, namely subspace transitivity and iterated projections, are also established. In a companion paper, a related minimization problem of interest in robust statistics that leads to a reverse projection is studied.
I Introduction
Relative entropy^{1}^{1}1The relative entropy of with respect to is defined as and the Shannon entropy of is defined as The usual convention is if and if . or KullbackLeibler divergence between two probability measures is a fundamental quantity that arises in a variety of situations in probability theory, statistics, and information theory. In probability theory, it arises as the rate function for estimating the probability of a large deviation for the empirical measure of independent samplings. In statistics, for example, it arises as the best error exponent in deciding between two hypothetical distributions for observed data. In Shannon theory, it is the penalty in expected compressed length, namely the gap from Shannon entropy , when the compressor assumes (for a finitealphabet source) a mismatched probability measure instead of the true probability measure .
Relative entropy also brings statistics and probability theory together to provide a foundation for the wellknown maximum entropy principle for decision making under uncertainty. This is an idea that goes back to L. Boltzmann, was popularized by E. T. Jaynes [3], and has its foundation in the theory of large deviation. Suppose that an ensemble average measurement (say sample mean, sample second moment, or any other similar linear statistic) is made on the realization of a sequence of independent and identically distributed (i.i.d.) random variables. The realization must then have an empirical measure that obeys the constraint placed by the measurement – the empirical measure must belong to an appropriate convex set, say . Large deviation theory tells us that a special member of , denoted , is overwhelmingly more likely than the others. If the alphabet is finite (with cardinality ), and the prior probability (before measurement) is the uniform measure on , then is the one that minimizes the relative entropy
which is the same as the one that maximizes (Shannon) entropy, subject to . This explains why the principle is called maximum entropy principle. In Jaynes’ words, “… it is maximally noncommittal to the missing information” [3].
As a physical example, let us tag a particular molecule in the atmosphere. Let denote the height of the molecule in the atmosphere. Then the potential energy of the molecule is . Let us suppose that the average potential energy is held constant, that is, , a constant. Then the probability distribution of the height of the molecule is taken to be the exponential distribution , where . This is also the maximum entropy probability distribution subject to first moment constraint [4].
More generally, if the prior probability (before measurement) is , then minimizes subject to . Something more specific can be said: is the limiting conditional distribution of a “tagged” particle under the conditioning imposed by the measurement. This is called the conditional limit theorem or the Gibbs conditioning principle; see for example Campenhout and Cover [5] or Csiszár [6] for a more general result.
It is wellknown that behaves like “squared Euclidean distance” and has the “Pythagorean property” (Csiszár [7]). In view of this and since minimizes subject to , one says that is “closest” to in the relative entropy sense amongst the measures in , or in other words, “ is the forward projection of on ”. Motivated by the above maximum entropy and Gibbs conditioning principles, projection was extensively studied by Csiszár [6], [7], Csiszár and Matúš [8], Csiszár and Shields [9], and Csiszár and Tusnády [10]. More recently, minimizations of general entropy functionals with convex integrands were studied by Csiszár and Matúš [11]. These include Bregman’s divergences and Csiszár’s divergences. minimization also arises in the contraction principle in large deviation theory (see for example Dembo and Zeitouni’s [12, p.126]).
This paper is on projections or minimization problems associated with a parametric generalization of relative entropy. To see how this parametric generalization arises, we return to our remark on how relative entropy arises in Shannon theory. For this, we must first recall how Rényi entropies are a parametric generalization of the Shannon entropy.
Rényi entropies for play the role of Shannon entropy when the normalized cumulant of compression length is considered instead of expected compression length. Campbell [13] showed that
for an i.i.d. source with marginal . The minimum is over all compression strategies that satisfy the Kraft inequality^{2}^{2}2A compression strategy assigns a target codeword length to each string ., , and is the cumulant parameter. We also have , so that Rényi entropy may be viewed as a generalization of Shannon entropy.
If the compressor assumed that the true probability measure is , instead of , then the gap in the normalized cumulant’s optimal value is an analogous parametric divergence quantity^{3}^{3}3Blumer and McEliece [14], in their attempt to find better upper and lower bounds on the redundancy of generalized Huffman coding, were indirectly bounding this parameterized divergence., which we shall denote [15]. The same quantity^{4}^{4}4We suggest the pronunciation “Ialpha” for . also arises when we study the gap from optimality of mismatched guessing exponents. See Arikan [16] and Hanawal and Sundaresan [17] for general results on guessing, and see Sundaresan [18],[15] on how arises in the context of mismatched guessing. Recently, Bunte and Lapidoth [19] have shown that the also arises as redundancy in a mismatched version of the problem of coding for tasks.
As one might expect, it is known that (see for example, Sundaresan [15, Sec. V5)] or Johnson and Vignat [20, A.1]) , so that we may think of relative entropy as . Thus is a generalization of relative entropy, i.e., a relative entropy^{5}^{5}5This terminology is from Lutwak, et al. [21]..
Not surprisingly, the maximum Rényi entropy principle has been considered as a natural alternative to the maximum entropy principle of decision making under uncertainty. This principle is equivalent to another principle of maximizing the socalled Tsallis entropy which happens to be a monotone function of the Rényi entropy. Rényi entropy maximizers under moment constraints are distributions with a powerlaw decay (when ). See Costa et al. [22] or Johnson and Vignat [20]. Many statistical physicists have studied this principle in the hope that it may “explain” the emergence of powerlaws in many naturally occurring physical and socioeconomic systems, beginning with Tsallis [23]. Based on our explorations of the vast literature on this topic, we feel that our understanding, particularly one that ought to involve a modeling of the dynamics of such systems with the observed powerlaw profiles as equilibria in the asymptotics of large time, is not yet as mature as our understanding of the classical BoltzmannGibbs setting. But, by noting that , we see that both the maximum Rényi entropy principle and the maximum Tsallis entropy principle are particular instances of a “minimum relative entropy principle”:
We shall call the minimizing as the forward projection of on .
The main aim of this paper is to study forward projections in general measure spaces. Our main contributions are on existence, uniqueness, and structure of these projections. We have several motivations to publish our work.

We provide a rather general sufficient condition on the constraint set under which a forward projection exists and is unique. This can enable statistical physicists to speak of the Rényi entropy maximizer and explore its properties even if the maximizer is not known explicitly. While the existence and uniqueness of projection for closed convex sets was shown for the finite alphabet case by Sundaresan [15], here we study more general measure spaces (for example ).

Unlike relative entropy, its generalization relative entropy does not, in general, satisfy the wellknown data processing inequality, nor is it in general convex in either of its arguments. Nevertheless, there is a remarkable parallelism between relative entropy and relative entropy. In particular, they share the “Pythagorean property” and behave like squared Euclidean distance. This too was explored by Sundaresan [15] for the finite alphabet case, and we wish to extend the parallels to more general alphabet spaces.

We provide information on the structure of the Rényi entropy maximizer, under linear statistical constraints, whenever the maximizer exists. This can provide statistical physicists a quick means to check if their empirical observations in a particular physical setting conform to the maximum Rényi entropy principle. It also provides a means to estimate the appropriate for a particular physical setting. Interestingly, the Rényi entropy maximizers belong to a “powerlaw family” of distributions that are the natural parametric generalizations of the Shannon entropy maximizers, namely the exponential family of distributions.

In a companion paper, we shall show that a robust parameter estimation problem is a “reverse projection” problem, where the minimization is with respect to the second argument of . If this reverse projection is on a powerlaw family, then one may turn the reverse projection into a forward projection of a specific distribution on an appropriate linear family. In that paper we shall also explore the geometric relationship between the powerlaw and the linear families.

One may think of the maximum entropy principle or the minimization of relative entropy as a “projection rule”; see Section VI for projection rules with some desired properties. Three of these properties are “regularity”, “locality”, and “subspacetransitivity”. It turns out that the based projection rule is regular, subspacetransitive when , but “nonlocal”. Any regular, subspacetransitive, and local projection rule is generated by Bregman’s divergences of the sumform [24]. In our, as yet not very successful, attempt to characterize all regular, subspacetransitive, but possibly nonlocal projection rules, we wished to understand as much as we could about a particular nonlocal projection rule. The understanding we have gained may be of use to the wider community interested in axiomatic approaches to abstract inference problems.
It is known (see for example [15]) that is the more commonly studied Rényi divergence of order , not of the original measures and , but of their escort measures and , where , and is the normalization that makes a probability measure. is similarly defined. While the Rényi divergences arise naturally in hypothesis testing problems (see for example Csiszár [25]), arises more naturally as a redundancy for mismatched compression, as discussed earlier. Moreover, is a certain monotone function of Csiszár’s divergence between and . As a consequence of the appearance of the escort measures, the dataprocessing property satisfied by the divergences does not hold for the divergences. It is therefore all the more intriguing that it is neither the divergences nor the Rényi divergences but the divergences that share the Pythagorean property with relative entropy. However, quite recently, van Erven and Harremoës [26] showed that Rényi divergences have a Pythagorean property when the forward projection is on a socalled convex set.
The paper is organized as follows. In Section II, we formally define and establish some of its basic algebraic and topological properties, those desired of an information divergence. In Section III, we establish the existence of projection on closed (in an appropriate topology) and convex sets. The proof for the case is analogous to that for relative entropy [7, Th. 2.1]. The proof for the case exploits some functional analytic tools. In Section IV, we present the Pythagorean property in generality and derive some of its immediate consequences in connection with the forward projection. In Section V, we characterize the forward projection on a linear family of probability measures, whenever it exists. In Section VI, we establish a desirable subspace transitivity property and further prove the convergence of an iterative method for finding the forward projection on linear families. In the concluding Section VII, we highlight some interesting open questions.
The companion paper [27] will explore the orthogonality between the powerlaw and the linear families, will exploit this orthogonality in a robust parameter estimation problem, and will study the reverse projection in detail.
Ii The relative  entropy
We begin by defining relative entropy on a general measure space for all except . As our definition will approach the usual relative entropy or KullbackLeibler divergence.
Let and be two probability measures on a measure space . Let with . Let be a dominating finite measure on with respect to which and are both absolutely continuous, denoted and . Write and and assume that and belong to the complete topological vector space with metric
We shall use the notation
even though , as defined, is not a norm for . For convenience we suppress the dependence of and on ; but this dependence should be borne in mind. Throughout we shall restrict attention to probability measures whose densities with respect to are in . The Rényi entropy of of order (with respect to ) is defined to be
(1) 
Consider the escort measures and having densities and with respect to defined by
(2) 
Once again, the dependence of and on is suppressed for convenience. By setting , we have the reparametrization in terms of with , , and . Define
Csiszár’s divergence [28] between two probability measures and , both absolutely continuous with respect to , is given by
(3) 
In the above definition we use the following conventions:
and for ,
Since is strictly convex when , by Jensen’s inequality, with equality if and only if .
Definition 1 (Relative entropy)
The entropy of relative to (or relative entropy of with respect to , or simply relative entropy) is defined as
(4) 
depends on the reference measure because the densities and defined in (2) do. However, for brevity, we omit the superscript and ask the reader to bear the dependence on in mind. For the information theoretic and statistical physics motivating examples in Section I, is the counting measure or the Lebesgue measure depending on whether is finite or .
From the conventions used to define , we have when either

and , or

and and are mutually singular.
Abusing notation a little, when speaking of densities, we shall some times write for . Let us reemphasize that implicit in our definition of is the assumption that and are both in .
The following are some alternative expressions of that are used in this paper:
(5)  
(6) 
When is discrete (with being the counting measure on ), the probability measures may be viewed as finite or countably infinite dimensional vectors. In this case, we may write
(7)  
(8) 
We now summarize some properties of relative entropy.
Lemma 2
The following properties hold.

(Positivity). with equality if and only if .

(Generalization of relative entropy). Let for some and simultaneously for some . Then is welldefined for all , and
where is the relative entropy of with respect to .

(Relation to Rényi divergence).
where
is the Rényi divergence of order .

(Relation to Rényi entropy). Let and let be the uniform probability measure on . Then

(Rényi entropy maximizer under a covariance constraint). Let and let be the Lebesgue measure on . For and , define the constant . With a positive definite covariance matrix, the function
with and the normalization constant, is the density function of a probability measure on whose covariance matrix is . Furthermore, if is the density function of any other random vector with covariance matrix , then
(9) Consequently is the density function of the Rényi entropy maximizer among all valued random vectors with covariance matrix .
Proof:
See Appendix A. \qed
Remark 1
Remark 2
While the numerical value of relative entropy does not depend on the dominating measure , recall that does depend on in general.
Analogous to the property that is lower semicontinuous in the topology on arising from the total variation metric [29, Sec. 2.4, Assertion 5], we have the following.
Proposition 3 (Lower semicontinuity in the first argument)
For a fixed , consider as a function on . This function is continuous for and lower semicontinuous for .
Proof:
See Appendix B. \qed
Remark 3
When , is lower semicontinuous, but not necessarily continuous. To see this, let be finite. Let be probability measures on such that all ’s have full support, i.e., for all , but for some , , and finally . Then for all , but .
Remark 4
If however is finite and has full support, then is indeed continuous and this can be seen by taking the limit term by term in (7).
We now address the behavior as a function of .
Proposition 4
Fix , . For a fixed , the mapping is lower semicontinuous in .
Proof:
See Appendix C \qed
Remark 5
When is finite, with as a potential limiting value, is continuous for all , , as is easily seen by taking termwise limits in the summation in (7).
We next establish quasiconvexity of in the first argument, i.e., for every fixed and real number , the lower level sets (or “balls”) are convex.
Proposition 5
Fix , . For a fixed , the mapping is quasiconvex in .
Proof:
See Appendix D \qed
Remark 6
In general, for both and , is not convex in either of its arguments. Moreover, does not satisfy the data processing inequality while relative entropy and more generally Csiszár’s divergences do.
Iii Existence and Uniqueness of the Forward projection
In this section, we shall introduce the notion of a forward projection of a probability measure on a subset of probability measures. We shall also prove a sufficiency result for the existence of the forward projection. We begin by first proving a useful inequality relating divergences. This is an inequality that turns out to be the analog of the parallelogram identity of [7] for relative entropy () and the analog of the Apollonius Theorem in plane geometry (see, for e.g., Bhatia [30, p. 85]). While these analogs show an equality, our generalization is at the cost of a weakening of the equality to an inequality.
Proposition 6 (Extension of Apollonius Theorem)
Let . Let be probability measures that are absolutely continuous with respect to , and let the corresponding RadonNikodym derivatives and be in . Assume . We then have
(10) 
where
(11) 
When , the reversed inequality holds in (10).
Proof:
See Figure 1 for an interpretation of (10) as an analog of the Apollonius Theorem. We first recognize that
(12) 
Let . Using (12), the lefthand side of (10) can be expanded to
where (a) follows from (11) and after a multiplication and a division by the scalar ; (b) follows from (12). The lemma would follow if we can show
for , and the reversed inequality for . But these are direct consequences of Minkowski’s inequalities for and applied to (11). \qed
Let us now formally define what we mean by a forward projection.
Definition 7
If is a set of probability measures on such that for some , a measure satisfying
(13) 
is called a forward projection of on .
For a set of probability measures on , let
be the corresponding set of densities. We shall assume that .
We are now ready to state our first main result on the existence and uniqueness of the forward projection.
Theorem 8 (Existence and uniqueness of the forward projection)
Fix , . Let be a set of probability measures whose corresponding set of density functions is convex and closed in . Let be a probability measure (with density ) and suppose that for some . Then has a unique forward projection on .
Remark 7
This is a generalization of Csiszár’s projection result [7, Th. 2.1] for relative entropy (). The analog of “ is closed in ” for relative entropy is closure in the topology arising from the total variation metric, one of the hypotheses in [7, Th. 2.1]. The proof ideas are different for the two cases and . The proof for is a modification of Csiszár’s approach in [7], and is similar to the classical proof of existence and uniqueness of the best approximant of a point (in a Hilbert space) from a given closed and convex set of the Hilbert space. (See, for e.g., [30, Ch. 11, Th. 14]). The proof for exploits the reflexive property of the Banach space . This alternative approach is required because the inequality in the extension of Apollonius Theorem (Proposition 6) is in a direction that renders the classical approach inapplicable. We are indebted to Pietro Majer for suggesting some key steps for the case on the mathoverflow.net forum.
Remark 8
In general, when , the forward projection depends on the reference measure . The case of relative entropy is however special in that the forward projection does not depend on the reference measure .
Remark 9
The above result was established by Sundaresan [15, Prop. 23] for finite . That proof relied on the compactness of . The current proof works for general measure spaces.
Proof:
(a) We first consider the case .
Existence of forward projection: Pick a sequence in such that and
(14) 
Apply Proposition 6 with to get
(15) 
where
on account of the convexity of . Using and then rearranging (15), we get
(16)  
(17) 
Now let . We claim the expression on the righthand side of (17) must approach 0. Indeed, that the liminf of the righthand side of (17) is at least 0 is clear from the inequalities (16) and (17). But the limsup is at most 0 because both and approach the infimum value, and is at least this infimum value for each and . This establishes the claim.
Consequently, the righthand side of (16) converges to 0. Using this and the nonnegativity of , we get
(18) 
From [31, Th. 1], a generalization of Pinsker’s inequality for divergence under , and with denoting the total variation distance between probability measures and , we have
The triangle inequality for the total variation metric then yields
as , i.e., the sequence is a Cauchy sequence in . It must therefore converge to some in , i.e.,
(19) 
It follows that , and since for all , we must have .
From the convergence in (19), we also have in measure.
We will now demonstrate that the probability measure with density proportional to is in and is a forward projection, thereby establishing existence.
In view of the convergence in measure and the upper bound
we can apply the generalized version of the dominated convergence theorem ([32, Ch. 2, Ex. 20] or [33, p.139, Problem 19]) to get
We next claim that
(20) 
Suppose not; then working on a subsequence if needed, we have . As , given any ,
and hence in measure, or except on a set of measure 0 (i.e., a.e.) . But this is a contradiction since . Thus (20) holds, and we can pick a subsequence of the sequence that converges to some . Reindex and work on this subsequence to get in .
It is now that we use the hypothesis that is closed in . We remind the reader that is the set of densities of members of . The closedness implies that the limiting function for some , and so must be the density of a probability measure, say . Since we also have , it follows that and . As in , lower semicontinuity of (Proposition 3) implies
(21) 
Since , , and therefore equality must hold in (21), and is a forward projection of on .
Uniqueness: Our proof of uniqueness is analogous to the usual proof of uniqueness of projection in Hilbert spaces [30, p. 86]. A simpler proof, after the ‘Pythagorean property’ is established, can be found at the end of Section IV.
Write for the infimum value in the righthand side of (14) and let and attain the infimum. Apply Proposition 6 with and with and in place of and to get
(22) 
where
Since we have . Use this in (22), substitute , and we get
and this implies
The nonnegativity of each of the terms then implies that each must be zero, and so . The forward projection is unique.
This completes the proof for the case when .
(b) We now consider the case when .
Existence of forward projection: Equation (13) can be rewritten (using (5)) as
(23)  
(24) 
where
and , an element of the dual space . Allowing makes convex (as we shall soon show), but does not change the supremum.
We now claim that
(25) 
Assume the claim. Since is a reflexive Banach space for , the convex and closed set is also closed in the weak topology [34, Ch. 10, Cor. 23]. Using the BanachAlaoglu theorem and the fact that is a reflexive Banach space, we have that the unit ball is compact in the weak topology. Since is a (weakly) closed subset of a (weakly) compact set, is (weakly) compact. The linear functional is continuous in the weak topology, and hence the supremum over the (weakly) compact set is attained. Since the linear functional increases with , the supremum is attained when , i.e., there exists a for which the supremum in (23) is attained.
We now proceed to show the claim (25). To see convexity, let , let , and let . The convex combination of and is
If both and are zero, then this convex combination is 0 which is trivially in . Otherwise, we can write the convex combination as
(26) 
where
(27)  
(28) 
To show that the convex combination is in , it suffices to show that and .
The convexity of immediately implies that . It is also clear that . From Minkowski’s inequality (for ), we have
(29)  
This establishes that is convex.
To see that is closed in , let be a sequence in such that for some . We need to show .
Write , where and . Since in , take norms to get , and so .
If , then a.e., and so trivially belongs to . We may therefore assume . It follows that in .
Again, as in (20), we claim that is bounded. Suppose not. As in the proof of (20), move to a subsequence if needed and assume . As , we have
as , and in measure, or its limit a.e.. But this contradicts the fact that . Thus is bounded.
Focusing on a subsequence, if needed, we may assume