Optimal measures and Markov transition kernelsThis work was supported by EPSRC grant EP/H031936/1.

# Optimal measures and Markov transition kernels1

## Abstract

We study optimal solutions to an abstract optimization problem for measures, which is a generalization of classical variational problems in information theory and statistical physics. In the classical problems, information and relative entropy are defined using the Kullback-Leibler divergence, and for this reason optimal measures belong to a one-parameter exponential family. Measures within such a family have the property of mutual absolute continuity. Here we show that this property characterizes other families of optimal positive measures if a functional representing information has a strictly convex dual. Mutual absolute continuity of optimal probability measures allows us to strictly separate deterministic and non-deterministic Markov transition kernels, which play an important role in theories of decisions, estimation, control, communication and computation. We show that deterministic transitions are strictly sub-optimal, unless information resource with a strictly convex dual is unconstrained. For illustration, we construct an example where, unlike non-deterministic, any deterministic kernel either has negatively infinite expected utility (unbounded expected error) or communicates infinite information.

## 1 Introduction

This work was motivated by the fact that probability measures within an exponential family, which are solutions to variational problems of information theory and statistical physics, are mutually absolutely continuous. Thus, we begin by clarifying and discussing this property in the simplest setting. Let be a finite set, and let be a real function. Consider the family of real functions , indexed by :

 yβ(ω)=eβx(ω)y0(ω),y0(ω)≥0 (1)

The elements of represent one-parameter exponential measures on , and normalized elements are the corresponding exponential probability measures. Of course, exponential measures can be defined on an infinite set, for example, as elements of the Banach space of real Radon measures on a locally compact space [11]. In this case, and are elements of the normed algebra of continuous functions with compact support in . As will be clarified later, can be considered not only as the dual of , but also as a module over algebra , which explains the definition of an exponential family (1) as multiplication of by elements of . Furthermore, for some , exponential measures are finite even if function is not continuous, has non-compact support and unbounded. A similar construction can be made in the case when is a non-commutative -algebra, such as the algebra of compact Hermitian operators on a separable Hilbert space used in quantum probability theory. However, quantum exponential measures can be defined in different ways, such as or , which are not equivalent.

One property that characterizes all these exponential measures is that elements within a family are mutually absolutely continuous. We remind that measure is absolutely continuous with respect to measure , if implies for all in the -ring of subsets of . Mutual absolute continuity is the case when the implication holds in both directions. It is easy to see from equation (1) that exponential measures within one family have exactly the same support and are mutually absolutely continuous. This property is particularly important, when measures are considered on a composite system, such as a direct product of two sets . Normalized measures on such are joint probability measures uniquely defining conditional probabilities (i.e. Markov transition kernels). Observe now that if and (product of marginals) are mutually absolutely continuous, then for all such that . Conditional probability with this property is non-deterministic, because several elements can be in the ‘image’ of . Clearly, all joint probability measures within an exponential family define such non-deterministic transition kernels.

Another, perhaps the most important, property of exponential families is that they are, in a certain sense, optimal. It is well-known in mathematical statistics that the lower bound for the variance of the unbiased estimator of an unknown parameter, defined by the Rao-Cramer inequality, is attained if and only if the probability distribution is a member of an exponential family [13, 31]. In statistical physics, it is known that exponential distributions (i.e. Boltzmann or Gibbs distributions) maximize entropy of a thermodynamical system under a constraint on energy [17]. In information theory, exponential transition kernels are known to maximize a channel capacity [33, 34, 35], and they are used in some randomized optimization techniques (e.g. [20]) as well as various machine learning algorithms [39]. A one-parameter exponential family has been studied in information geometry, and it was shown to be a Banach space with an Orlicz norm [30]. Similar constructions have been considered in quantum probability [10, 36].

Optimality of exponential families of measures on one hand and their mutual absolute continuity on the other is a particularly interesting combination, because it seems that for the first time we have an optimality criterion, with respect to which all deterministic transitions between elements of a composite system are strictly sub-optimal. This appears to have importance not only for information and communication theories, but also for theories of computational and algorithmic complexity, because Markov transition kernels can be used to represent various input-output systems, including computational systems and algorithms. Thus, understanding the relation between mutual absolute continuity within some families of measures and their optimality was the main motivation for this work.

It is well-known, and will be reminded later in this paper, that a one-parameter exponential family of probability measures is the solution to a variational problem of minimizing Kullback-Leibler (KL) divergence [23] of one probability measure from another subject to a constraint on the expected value. In fact, the logarithmic function, which appears in the definition of the KL-divergence, is precisely the reason why the exponential function appears in the solutions. However, mutual absolute continuity, which for composite systems implies the non-deterministic property of conditional probabilities, is not exclusive to families of exponential measures. Indeed, geometrically, this property simply means that measures are in the interior of the same positive cone, defined by their common support. Thus, our method is based on a generalization of the above mentioned variational problem by relaxing the definition of information and then employing geometric analysis of its solutions.

In the next section, we introduce the notation, define the generalized optimization problem and recall some basic relevant facts. An abstract information resource will be represented by a closed functional , defined on the space of measures, and such that its values can be associated with values of some information distance (e.g. the KL-divergence). In Section 3 we establish several properties of optimal solutions. In particular, we prove in Proposition 3 that the optimal value function is order isomorphism putting information in duality with expected utility of an optimal system. These results are then used in Section 4 to prove a theorem relating mutual absolute continuity of optimal positive measures to strict convexity of functional , the Legendre-Fenchel dual of representing information resource. We show that strict convexity of is necessary to separate different variational problems by optimal measures, and for this reason it appears to be a natural minimal requirement on information, generalizing the additivity axiom. Because proof of mutual absolute continuity does not depend on commutativity of algebra , pre-dual of , these results apply to a general, non-commutative setting used in quantum probability and information theories. In Section 5, we discuss optimal Markov transition kernels (conditional probabilities) in the classical (commutative) setting, which is done for simplicity reasons. We shall recall several facts about transition kernels, information capacity of memoryless channels they represent and the corresponding variational problems. The main result of this section is a theorem separating deterministic and non-deterministic kernels. We show how mutual absolute continuity of optimal Markov transition kernels implies that optimal transitions are non-deterministic; deterministic transitions are strictly suboptimal if information, understood broadly here, is constrained. This result will be illustrated by an example, where any deterministic kernel either has a negatively infinite expected utility (unbounded expected error) or communicates infinite information; a non-deterministic kernel, on the other hand, can have both finite expected utility and finite information. In the end of the section we shall consider applications of this work to theories of algorithms and computational complexity. We shall discuss how deterministic and non-deterministic algorithms can be represented by Markov transition kernels between the space of inputs and the space of output sequences, and how constraints on the expected utility or complexity of the algorithms are related to variational problems studied in this work. The paper concludes by a summary and discussion of the results.

## 2 Preliminaries

This work is based on a generalization of classical variational problems of information theory and statistical physics, which can be formulated as follows. Let be a measurable set and let be the set of all Radon probability measures on . We denote by the expected value of random variable with respect to . An information distance is a function that is closed (lower semicontinuous) in each argument. An important example is the Kullback-Leibler divergence [23]. We remind that is linear in , and is convex. The variational problem is formulated as follows:

 maximize (minimize)Ep{x}subject toEp{ln(p/q)}≤λ (2)

where optimization is over probability measures . This problem can be considered as linear programming with an infinite number of linear constraints, and it can be formulated as the following convex programming problem:

 minimizeEp{ln(p/q)}subject toEp{x}≥υ(Ep{x}≤υ) (3)

Figure 1 illustrates these variational problems on a -simplex of probability measures over a set of three elements with the uniform distribution as the reference measure.

In optimization and information theories, represents expected utility to be maximized or expected cost to be minimized. In physics, it represents internal energy. Information distance is also called relative entropy, and the inequality represents an information constraint. Depending on the domain of definition of the probability measures, the information constraint may have different meanings, such as a lower bound on entropy (i.e. irreducible uncertainty), partial observability of a random variable, a constraint on the amount of statistical information (i.e. a number of independent tests, questions or bits of information), on communication capacity of a channel, on memory of a computational device and so on [35]. These variational problems can also be formulated in quantum physics, where is an element of a non-commutative algebra of observables, and , are quantum probabilities (states).

As is well-known, solutions to problems (2) and (3) are elements of an exponential family of probability distributions. Before we define an appropriate generalization of these problems, we remind some axiomatic principles underpinning the choice of functionals.

### 2.1 Axioms behind the choice of functionals

The choice of linear objective functional has axiomatic foundation in game theory [27], where is equipped with total pre-order , called the preference relation, and function is its utility representation: if and only if . Because the quotient set of a pre-ordered set with a utility function is isomorphic to a subset of the real line, it is separable and metrizable by , and therefore every probability measure on the completion of is Radon (e.g. by Ulam’s theorem for probability measures on Polish spaces).

The set of all classical probability measures on is a simplex with Dirac measures comprising the set of its extreme points [29]. The question that has been discussed extensively is: How to extend pre-order , which was defined on , to the whole ? It was shown in [27] that linear (or affine) functional is the only functional that makes the extended pre-order compatible with the vector space structure of and Archimedian. We remind that for the corresponding pre-order this is defined by the axioms:

1. implies and for all and .

2. for all implies .

In this paper we shall follow this formalism assuming that the objective functional is linear. We note that non-linearity may arise in certain dynamical systems, where may change with time, but this will not be considered in this work, because our focus is on optimization problems with respect to some fixed preference relation or utility on . A non-commutative (quantum) analogue of a utility function was given in [7] by a Hermitian operator on a separable Hilbert space (an observable) with its real spectrum representing a total pre-order on its eigen states. The principal difference with the classical theory is the existence of incompatible (non-commutative) utility operators.

As mentioned earlier, information constraints may be related to different phenomena (e.g. uncertainty, observability, statistical data, communication capacity, memory, etc). However, in information theory they often have been represented by functionals, such as relative entropy or Shannon information, which are defined using the Kullback-Leibler divergence . Its choice is also based on a number of axioms [14, 19, 33], such as additivity: . In fact, this axiom is precisely the reason why the logarithm function appears in its definition (i.e. as homomorphism between multiplicative and additive groups of ). There is, however, an abundance of other information distances and metrics, such as the Hellinger distance, total variation and the Fisher metrics. Although they often fail to have a proper statistical interpretation [12], there has been a renewed interest in using different information distances and contrast functions in applications to compare distributions (e.g. see [4, 6, 26]).

For reasons outlined above, we shall generalize problems (2) and (3) by considering an abstract information distance or resource, which will be used to define a subset of feasible solutions. In addition, we shall not restrict the problems to normalized measures, which makes the exposition a lot simpler. Normalization can be performed at a later stage. We now define an appropriate algebraic structure.

### 2.2 Dual algebraic structures

Let and be complex linear spaces put in duality via bilinear form :

 ⟨x,y⟩=0, ∀x∈X ⇒y=0,⟨x,y⟩=0, ∀y∈Y ⇒x=0

We denote by the algebraic dual of , by the continuous dual of a locally convex space and by the complete normed dual space of . The same notation applies to dual spaces of . The results will be derived using only the facts that and are ordered linear spaces in duality. These spaces, however, can have richer algebraic structures, which we briefly outline here.

Space is closed under an associative, but generally non-commutative binary operation (e.g. pointwise multiplication or matrix multiplication) and involution as a self-inverse, antilinear map reversing the multiplication order: . Thus, is a -algebra. The set of all Hermitian elements is a real subspace of , and if every has positive real spectrum, then is called a total -algebra, in which the spectrum of all Hermitian elements is real. In this case, Hermitian elements form a pointed convex cone , generating .

The dual space is closed under the transposed involution , defined by . It is ordered by a positive cone , dual of , and it has order unit (also called a reference measure), which is a strictly positive linear functional: for all . If the pairing has the property that for each there exists a transposed element such that , then is a left (right) module over with respect to the transposed left (right) action () of on such that and (see [9], Appendix). In many practical cases, the pairing is central (or tracial), so that the left and right transpositions act identically on : for all . In this case, the element can be identified with a complex conjugation of .

Two primary examples of a total -algebra , which are important in this work, are the commutative algebra of continuous functions with compact support in a locally compact topological space and the non-commutative algebra of compact Hermitian operators on a separable Hilbert space . The corresponding examples of dual space are the Banach space of complex signed Radon measures on and its non-commutative generalization . Note that these examples of algebra are generally incomplete and contain only an approximate identity. However, by we shall understand here an extended algebra that contains additional elements. In particular, will contain the unit element such that if (i.e. coincides on with the norm , which is additive on ). Furthermore, because constraints in variational problems (2) or (3), or their generalizations, define a proper subset of space , we can consider random variables represented by elements that are outside of the Banach space (e.g. unbounded functions or operators).

Below are three main examples of pairing and by a sum, an integral or trace:

 ⟨x,y⟩:=∑Ωx(ω)y(ω),⟨x,y⟩:=∫Ωx(ω)dy(ω),⟨x,y⟩:=tr{xy} (4)

Although the linear functionals are generally complex-valued, we shall assume, without further mentioning, that is evaluated on Hermitian elements and so that . In particular, the expected value , where is Hermitian and is positive. Thus, the expressions ’maximize (minimize) ’ should be understood accordingly as maximization or minimization of a real functional.

### 2.3 Generalized variational problems for measures

Normalized non-negative measures (i.e. probability measures) are elements of the set:

 P:={y∈Y:y≥0, ⟨1,y⟩=1}

This is a weakly compact convex set, and therefore by the Krein-Milman theorem. In the commutative case, is a simplex, because each is uniquely represented by extreme points [29]. In information geometry is referred to as statistical manifold, and its topological properties have been studied by defining different information distances [3, 12, 30]. We can generalize this by considering information resource as a functional, defined for all positive or Hermitian elements.

Let be a closed functional, so that is finite at some , and sublevel sets are closed in the weak topology for each . Because is not included in the definition of closed , it is also lower-semicontinuous [32]. We shall assume without further mentioning that the effective domain has non-empty algebraic interior. In addition, if is defined over the field of complex numbers, we shall also assume that contains only Hermitian elements (e.g. ).

Variational problems (2) and (3) are generalized by considering all, not necessarily positive or normalized measures, and by using any closed functional to define an information resource. The optimal values achieved by solutions to these problems are defined by the following optimal value functions:

 ¯¯¯x(λ) := sup{⟨x,y⟩:F(y)≤λ} (5) x––(λ) := inf{⟨x,y⟩:F(y)≤λ} (6) ¯¯¯x−1(υ) := inf{F(y):⟨x,y⟩≥υ} (7) x––−1(υ) := inf{F(y):⟨x,y⟩≤υ} (8)

We define , if , and as . Observe that and . Thus, it is sufficient to study only the properties of . Figure 2 depicts schematically the optimal value functions and . It is clear from the definition that is a non-decreasing extended real function, and is non-increasing. It will be shown also in the next section that is concave, and is convex (Proposition 3). Because sets may be unbalanced and unbounded, the functions may not be reflections of each other in the sense that for all , and one or both functions can be empty. The definition of the optimal value functions (5)–(8) in terms of functional of one variable, unlike information distance , allows for considering the case when is not achieved at any .

In addition to , we define two special values and of functional as follows:

 ¯¯¯x(¯¯¯λ):=sup{⟨x,y⟩:y∈domF},x––(λ––):=inf{⟨x,y⟩:y∈domF} (9)

Thus, problems of maximization or minimization of subject to constraints or respectively are equivalent to unconstrained problems on . The corresponding optimal values are denoted and , as shown on Figure 2. The reason for defining these values is that generally , and (see Figure 2). Solutions to unconstrained problems may correspond to large, possibly infinite values or , and therefore they can be considered unfeasible. Subsets of feasible solutions will be defined by constraints or .

In addition, we define the following special values:

 ¯¯¯υ0:=limλ↓infFsup{⟨x,y⟩:F(y)≤λ},υ––0:=limλ↓infFinf{⟨x,y⟩:F(y)≤λ} (10)

If there exists a set such that for all , then and . If is unique, then ; otherwise (see Figure 2). Elements represent trivial solutions, because they correspond to constraint in functions and . Constraints and in the inverse functions and ensure that , and the solutions are non-trivial.

### 2.4 Some facts about subdifferentials of dual convex functions

In the next section, we show that solutions to the generalized variational problems with optimal values (5)–(8), if exist, are elements of a subdifferential of functional , dual of . We remind that is the Legendre-Fenchel transform of :

 F∗(x):=sup{⟨x,y⟩−F(y)}

and it is aways closed and convex (e.g. see [32, 38]). Condition implies is closed and convex. Otherwise, the epigraph of is a convex closure of the epigraph of in . Closed and convex functionals are continuous on the (algebraic) interior of their effective domains (e.g. see [25] or [32], Theorem 8), and they have the property

 x∈∂F(y)⟺∂F∗(x)∋y (11)

where set is subdifferential of at , and its elements are called subgradients. In particular, implies for all (i.e. ). We point out that the notions of subgradient and subdifferential make sense even if is not convex or finite at , but non-empty implies and , ([32], Theorem 12).2 Functional is strictly convex if and only if is injective, so that the inverse mapping is single-valued.

Recall also that subdifferential of a convex function is an example of monotone operator [18]:

 ⟨x1−x2,y1−y2⟩≥0,∀yi∈∂F∗(xi) (12)

The inequality is strict for all if and only if is injective (i.e. is strictly monotone).

We remind also that is concave if is convex. The dual of in concave sense is . By analogy, one defines supgradient and supdifferential of a concave function [32].

## 3 General properties of optimal solutions and the optimal value functions

In this section, we apply the standard method of Lagrange multipliers to derive solutions achieving the optimal value . Then we shall study existence of solutions and monotonic properties of the optimal value functions (5)–(8).

### 3.1 Optimality conditions

###### Proposition 1 (Necessary and sufficient optimality conditions).

Element maximizes linear functional on sublevel set of a closed functional if and only if the following conditions hold

 yβ∈∂F∗(βx),F(yβ)=λ

where parameter is related to via .

###### Proof.

If maximizes on sublevel set , then it belongs to the boundary of (because is linear and is closed). Moreover, belongs also to the boundary of a convex closure of , because it is the intersection of all closed half-spaces containing . Observe also that

 clco{y:F(y)≤λ}={y:F∗∗(y)≤λ}

and therefore solutions satisfy condition and (e.g. see [32], Theorem 12). Thus, the Lagrange function for the conditional extremum in (5) can be written in terms of as follows

 K(y,β−1)=⟨x,y⟩+β−1[λ−F∗∗(y)],

where is the Lagrange multiplier for the constraint . This Lagrange function is concave for , and therefore condition is both necessary and sufficient for and to define its least upper bound, which gives

 ∂yK(yβ,β−1)=x−β−1∂F∗∗(yβ)∋0, ⇒ yβ∈∂F∗(βx) ∂β−1K(yβ,β−1)=λ−F∗∗(yβ)=0, ⇒ F∗∗(yβ)=λ

Note that if , then generally , and condition must be replaced by a stronger condition .

Noting that , the Lagrange multiplier is defined by . Note that , because is non-decreasing, and if and only if . ∎

###### Remark 1.

The inverse optimal value , defined by equation (7), is achieved by solutions given by similar conditions. Indeed, the corresponding Lagrange function is

 K(y,β)=F∗∗(y)+β[υ−⟨x,y⟩]

and the necessary and sufficient conditions are

 yβ∈∂F∗(βx),⟨x,yβ⟩=υ

where is related to via . We note also that conditions for optimal values and , defined by equations (6) and (8), are identical to those in Proposition 1 and above with the exceptions that and .

### 3.2 Existence of solutions

The existence of optimal solutions in Proposition 1 is equivalent to finiteness of , which depends on the properties of sublevel set and linear functional . Clearly, the existence of solutions is guaranteed if is bounded in and . This setting, however, appears to be too restrictive. First, the restriction of to Banach space is not desirable in many applications. Indeed, measures are often considered as elements of a Banach space with norm of absolute convergence, and therefore is complete with respect to the Chebyshev (supremum) norm . Many objective functions, however, such as utility or cost functions, are expressed using unbounded forms, such as polynomials, logarithms and exponentials. Second, the sublevel sets are generally unbalanced (i.e. if or ), which means that , and therefore does not imply . In addition, sets can be unbounded in if we allow for measures that are not necessarily normalized. In this case, finiteness of is no longer guaranteed, even if . These considerations motivate us to define the most general class of linear functionals (elements of algebraic dual) that admit optimal solutions to the generalized variational problems for measures and achieving finite optimal values for all constraints.

###### Definition 1 (F-bounded linear functional).

An element is bounded above (below) relative to a closed functional or -bounded above (below) if it is bounded above (below) on sets for each (). We call -bounded if it is -bounded above and below.

Thus, bounded linear functionals are -bounded. If is understood as information, then we speak of information-bounded functionals. Although we do not address topological questions in this paper, we point out that the values coincide with the values of support function of set , and it generalizes a seminorm on . In fact, a seminorm can be defined for -bounded elements as , which means they form a topological vector space. There are, however, elements that are only -bounded above or below, as will be illustrated in the next example.

###### Example 1.

Let and let , be the spaces of real sequences and with pairing defined by the sum (4). Let for , so that the gradient , and is minimized at the counting measure . The optimal solutions have the form , and the values of functions and are respectively

 ⟨x,yβ⟩=∞∑n=1x(n)eβx(n)% and⟨x,yβ⟩=∞∑n=1x(n)e−βx(n),β−1>0

In particular, for , the first series converges to , but the second diverges for any . Thus, is -bounded above, but not below. Observe also that is unbounded, because is infinite. On the other hand, any constant sequence is bounded (), but it is not -bounded above or below.

The criterion for element to be -bounded above follows from the optimality conditions, obtained in Proposition 1.

###### Proposition 2 (Existence of solutions).

Solutions maximizing on sets exist for all values of a closed functional , if there exists at least one number such that subdifferential is non-empty.

###### Proof.

The element maximizes on by Proposition 1, and if and , then . The optimal value is equal to

 ⟨x,yβ⟩=β−1[F∗(βx)+F(yβ)]

Note also that . Because sets are closed for all ( is closed), the existence of a solution for one implies the existence of solutions for all , and they are enumerated by different values . ∎

Thus, element is -bounded above if is non-empty at least for one . Geometrically, this means that can be absorbed into the convex set for some . If is also -bounded below, then can be absorbed into . Therefore, if is -bounded only above or below, then the origin of a one-dimensional subspace is not on the interior of . In fact, it is well-known that if sets are bounded, then (see [5, 25]).

### 3.3 Monotonic properties

###### Proposition 3 (Monotonicity).

Optimal value functions , , and , defined by equations (5), (6), (7) and (8) for a closed and , have the following properties:

1. The mapping is non-increasing, and is non-decreasing.

2. If in addition is strictly convex, then these mappings are differentiable so that and .

3. is concave and strictly increasing for .

4. is convex and strictly decreasing for .

5. is convex and strictly increasing for .

6. is convex and strictly decreasing for .

where , are defined by equations (9), and , by equations (10).

###### Proof.
1. Let , be maximizers of linear functional on sublevel sets with constraints , respectively, and let and denote the corresponding optimal values. Clearly, implies by the inclusion , so that the optimal value function is non-decreasing. Using condition of Proposition 1 and monotonicity condition (12) for convex , we have

 ⟨β2x−β1x,yβ2−yβ1⟩=(β2−β1)⟨x,yβ2−yβ1⟩≥0

Therefore, implies . This proves that is non-increasing, and is non-decreasing.

2. Optimality condition is equivalent to by property (11), and together with condition or it implies that different can correspond to the same or if and only if includes both and . This implies that is not strictly convex on . Dually, if is strictly convex, then implies and , so that and . In this case, monotone functions and are differentiable.

3. Function is strictly increasing on , because and if and only if (Proposition 1). The mapping