On Game-Theoretic Risk Management (Part One)

# On Game-Theoretic Risk Management (Part One)

Stefan Rass
stefan.rass@aau.at
Universität Klagenfurt, Institute of Applied Informatics, System Security Group, Universitätsstrasse 65-67, 9020 Klagenfurt, Austria. This work has been done in the course of consultancy for the EU Project HyRiM (Hybrid Risk Management for Utility Networks; see https://hyrim.net), led by the Austrian Institute of Technology (AIT; www.ait.ac.at). See the acknowledgement section.
###### Abstract

Abstract

Optimal behavior in (competitive) situation is traditionally determined with the help of utility functions that measure the payoff of different actions. Given an ordering on the space of revenues (payoffs), the classical axiomatic approach of von Neumann and Morgenstern establishes the existence of suitable utility functions, and yields to game-theory as the most prominent materialization of a theory to determine optimal behavior. Although this appears to be a most natural approach to risk management too, applications in critical infrastructures often violate the implicit assumption of actions leading to deterministic consequences. In that sense, the gameplay in a critical infrastructure risk control competition is intrinsically random in the sense of actions having uncertain consequences. Mathematically, this takes us to utility functions that are probability-distribution-valued, in which case we loose the canonic (in fact every possible) ordering on the space of payoffs, and the original techniques of von Neumann and Morgenstern no longer apply.

This work introduces a new kind of game in which uncertainty applies to the payoff functions rather than the player’s actions (a setting that has been widely studied in the literature, yielding to celebrated notions like the trembling hands equilibrium or the purification theorem). In detail, we show how to fix the non-existence of a (canonic) ordering on the space of probability distributions by only mildly restricting the full set to a subset that can be totally ordered. Our vehicle to define the ordering and establish basic game-theory is non-standard analysis and hyperreal numbers.

\subtitle

Towards a Theory of Games with Payoffs that are Probability-Distributions

## 1 Introduction

Security risk management is a continuous cycle of action and reaction to the changing working conditions of an infrastructure. This cycle is detailed in relevant standards like ISO 2700x, where phases designated to planning, doing, checking and acting are rigorously defined and respective measures are given.

Our concern in this report is an investigation of a hidden assumption underneath this recommendation, namely the hypothesis that some wanted impact can be achieved by taking the proper action. If so, then security risk management would degenerate to a highly complex but nevertheless deterministic control problem, to which optimal solutions and strategies could be found (at least in theory).

Unfortunately, however, reality is intrinsically random to some extent, and the outcome of an action is almost never fully certain. Illustrative examples relate to how public opinion and trust are dependent on the public relation strategies of an institution. While there are surely ways to influence the public opinion, it will always be ultimately out of one’s full and exclusive control. Regardless of this, we ought to find optimal ways to influence the situation in the way we like. This can – in theory – again be boiled down to a (not so simple) optimization problem, however, one that works on optimizing partially random outcomes. This is where things start to get nontrivial.

Difficulties in the defense against threats root in the nature of relevant attacks, since not all of them are immediately observable or induce instantly noticeable or measurable consequences. Indeed, the best we can do is finding an optimal protection against an a-priori identified set of attack scenarios, so as to gain the assurance of security against the known list of threat scenarios. Optimizing this protection is often, but not necessarily, tied to some kind of adversary modelling, in an attempt to sharpen our expectations about what may happen to us. Such adversary modeling is inevitably error prone, as the motives and incentives of an attacker may deviate from our imagination to an arbitrary extent.

Approaching the problem mathematically, there are two major lines of decision making: one works with an a-priori hypothesis of the current situation, and incorporates current information into an a-posteriori model that tells us how things will evolve, and specifically, which events are more likely than others, given the full information that we have. Decision making in that case means that we seek the optimal behavior so as to master a specifically expected setting (described by the a-posteriori distribution). This is the Bayesian approach to decision making (see [10] for a fully comprehensive detailed). The second way of decision making explicitly avoids any hypothesis about the current situation, and seeks an optimal behavior against any possible setting. Unlike the Bayesian perspective, we would thus intentionally and completely ignore all available data and choose our actions to master the worst-case scenario. While this so-called minimax decision making is obviously a more pessimistic and cautious approach, it appears better suited for risk management in situations where data is either not available, not trustworthy or inaccurate.

For this reason, we will hereafter pursue the minimax-approach and dedicate section 4.2 to a discussion how this fits into the Bayesian framework as a special case.

We assume that the risk manager can repeatedly take actions and that the possible actions are finitely many. Furthermore, we assume that the adversary against which we do our risk control, also has a finite number of possible ways to cause trouble. In terms of an ISO 2700k risk management process, the risk manager’s actions would instantiate controls, while the adversary’s actions would correspond to identified threat scenarios. The assumption of finiteness does stringently constrain us here, as an infinite number of actions to choose from may in any case overstrain a human decision-maker.

The crucial point in all that follows is that any action (as taken by the decision maker) in any situation (action taken by the adversary) may have an intended but in any case random outcome. To properly formalize this and fit it into a mathematical, in fact game-theoretic, framework, we hereafter associate the risk manager with player 1 in our game, who competes with player 2, who is the adversary. Actions of either players are in game-theoretic literature referred to as pure strategies; the entirety of which will be abbreviated as and for either player, so comprises all actions, hereafter called strategies available for risk management, while comprises all trouble scenarios. For our treatment, it is not required to be specific on how the elements in both action sets look like, as it is sufficient for them to “be available”.

Let denote finite sets of strategies for two players, where player 1 is the honest defender (e.g., utility infrastructure provider), and player 2 is the adversary. We assume player 1 to be unaware of its opponents incentives, so that an optimal strategy is sought against any possible behavior within the known action space of the opponent (rational or irrational, e.g., nature),.

In this sense, can be the set of all known possible security incidents, whose particular incarnations can become reality by the adversary’s action. To guard its assets, player 1 can choose from a finite set of actions to minimize the costs of a recovery from any incident, or equivalently, keep its risk under control.

Upon these assumptions, the situation can be described by an -matrix of scenarios, where , each of which is associated with some cost to recover the system from a malfunctioning state back to normal operation from scenario . We use the variable henceforth to denote the cost of a repair made necessary by an incident happening when the system is currently in configuration .

The process of risk management will be associated with player 1 putting the system into different configurations over time in order to minimize the risk .

###### Remark 1.1

We leave the exact understanding of “risk” or “damage” intentionally undefined here, as this will be quite different between various utility infrastructures or general fields of application.

###### Remark 1.2

Neither the set nor the set is here specified in any detail further than declaring it as an “action space”. The reason is, again, the expected diversity of actions and incidents among various fields of application (or utility infrastructures). Therefore and to keep this report as general and not limiting the applicability of the results to follow, we will leave the precise elements of up to definitions that are tailored to the intended application.

Examples of strategies may include:

• random spot checks in the system to locate and fix problems (ultimately, to keep the system running),

• random surveillance checks and certain locations,

• certain efforts or decisions about whether or not, and which, risks or countermeasures shall be communicated to the society or user community,

• etc.

In real life settings, it can be expected that an action (regardless of who takes it), always has some intrinsic randomness. That is, the effect of a particular scenario is actually a random variable , having only some “expected” outcome that may be different between any two occurrences of the same situation over time.

To be able to properly handle the arising random variables, let us think of those modeling not the benefits but rather the damage that a security incident may cause. In this view, we can go for minimization of an expectedly positive value that measures the cost of a recovery. Formally, we introduce the following assumption that will greatly ease theoretical technicalities throughout this work, while not limiting the practicability too much.

The family of random damage distributions in our game will be assumed with all members satisfying the following assumption:

###### Assumption 1.3

Let be a real-valued random variable. On , we impose the following assumptions:

• (w.l.o.g.111It is common to assume losses to be ; our modification has technical reasons, but causes no semantic difference in the comparisons between two loss densities, since both loss variables are just shifted by the same amount. Also, the loss can (w.l.o.g.) be scaled until losses in the range become practically negligible.).

• has a known distribution with compact support (note that this implies that all is upper-bounded).

• The probability measure induced by is either discrete or continuous and has a density function . For continuous random variables, the density function is assumed to be continuous.

### 1.1 Symbols and Notation

This section is mostly intended to refresh the reader’s memory about some basic but necessary concepts from calculus and probability theory that we will use in the following to develop the theoretical groundwork. This subsection can thus be safely skipped and may be consulted whenever necessary to clarify details.

#### General Symbols:

Sets, random variables and probability distribution functions are denoted as upper-case letters like or . Matrices and vectors are denoted as bold-face upper- and lower-case letters, respectively. For finite sets, we write for the number of elements (cardinality). For real values denotes the absolute value of . For arbitrary sets, the symbol is the -fold cartesian product of ; the set thus represents the collection of all infinite sequences with elements from . We denote such a sequence as .

If is a random variable, then its probability distribution is told by the notation . Whenever this is clear from the context, we omit the subscript to and write only. If lives on a discrete set, then we call a discrete random variable. Otherwise, if takes values in some infinite and uncountable set, such as , then we call a continuous random variable. For discrete distributions, we may also use the vector of probabilities of each event to denote the distribution of the discrete variable as .

Calligraphic letters denote families (sets) of sets or probability distributions, e.g., ultrafilters (defined below) are denoted as , or the family of all probability distributions being denoted as . The family of subsets of a set is written as (the power-set of ). If is a probability distribution, then its density – provided it exists – is denoted by the respective lower-case letter .

#### Topology and Norms:

As our considerations in section 3 will heavily rely on concepts of continuity and compactness or openness of sets, we briefly review the necessary concepts now.

A set is called open, if for every there is another open set that contains . The family of all open sets is characterized by the property of being closed under infinite union and finite intersection. Such a set is called a topology, and the set together with a topology is called a topological space. An interval is called closed, if its complement (w.r.t. the space ) is open.

In , it can be shown that the open intervals are all of the form for and . We denote these intervals by and the topology on is the set containing all of them. Note the existence of a total ordering on a space always induces the so-called order-topology, whose open sets are defined exactly the aforementioned way. Closed intervals are denoted by square brackets, . An set is called bounded, if there are two constants so that all satisfy . An subset of is called compact, if and only if it is closed and bounded.

For being two metric spaces, we call a function continuous, if for every and every there is some for which implies . If the condition holds with the same for every , then we call uniformly continuous on the set . It can be shown that if a function is continuous on a compact set , then it is also uniformly continuous on (in general, however, continuity does not imply uniform continuity). In the following, we will need this result only on functions mapping compact subsets of to probability distributions (the space that we consider there will be the set of hyperreal numbers, which has a topology but – unfortunately – neither a metric nor a norm).

On a space , we write to denote the norm of a vector . One example is the -norm on , which is for every . This induces the metric .

It can be shown that every metric space is also a topological space, but the converse is not true in general. However, the above definition of continuity is (on metric spaces) equivalent to saying that a function is continuous, if and only if every open set in has an open preimage , when denote the topologies on and , respectively. This characterization works without metrics and will be used later to prove continuity of payoff functions (see lemma 3.1 and proposition 3.2).

#### Probabilities and Moments:

Let be a subset of some measurable222We will not require any further details on measurability or -algebras in this report, so we spare details or an intuitive explanation of the necessary concepts here. space and be a probability distribution function. The probability measure is the Lebesgue-Stieltjes integral (note that this general formulation covers both, discrete and continuous random variables on the same formal ground). Whenever the distribution is obvious from the context, we will omit the subscript to the probability measure, and simply write as a shorthand of .

All probability distribution functions that we consider in this report will have a density function associated with them. If so, then we call the closure of the set the support of , denoted as . A degenerate distribution on is one that assigns probability mass 1 to a finite number (or more generally, a null-set) of points in . If for a singleton set and , then we call this degenerate distribution a point-mass or a Dirac-mass. We stress that such distributions do not have a density function associated with them in general333at least not within the space of normal functions; the Dirac-mass is, however, an irregular generalized function (a concept that we will not need here)..

Many commonly used distributions have infinite support, such as the Gaussian distribution. The density function can, however, be cut off outside a bounded range and re-scaled to normalize to a probability distribution again. This technique lets us approximate any probability distribution by one with compact support (a technique that will come handy in section 2.5).

The expectation of a random variable is (by the law of large numbers) the long-run average of realizations, or more rigorously, defined as . The -th moment of a distribution is the expectation of , which we is denoted and defined as , or also , if has a density function . Special roles play the first four moments or values derived thereof. One prominent example is the variance (this formula is known as Steiner’s theorem). Of particular importance is the so-called moment-generating function , from which the -th moment can be computed by taking the -th order derivative evaluated at the origin, i.e., we have . Moments do not necessarily exist for all distributions (an example is the Cauchy-distribution, for which all moments are infinite), but exist for all distributions with compact support (that can be used to approximate every other distribution up to arbitrary precision).

Multivariate distributions model vector-valued random variables. Their distribution is denoted as , or shorthanded as . For an -dimensional distribution, the respective density function is then of the form , having the integral . This joint distribution in particular models the interplay between the (perhaps mutually dependent) random variables . The marginal distribution of any of the variables (where ) is the unconditional distribution of no matter what the other variables do. Its density function is obtained by “integrating out” the other variables, i.e.,

 fXi(xi)=∫\mathdsRn−1f(x1,…,xi−1,xi+1,…,xn)d(x1,…,xi−1,xi+1,…,xn).

A (marginal) distribution is called uniform, if its support is bounded and its density is a constant. The joint probability of a multivariate event, i.e., multidimensional set w.r.t. to a multivariate distribution , is denoted as . That is, the distribution w.r.t. which the probabilities are taken are given in the subscript, whenever this is useful or necessary to make things clear.

A particular important class of distributions are copulas. These are multivariate probability distribution functions on the -dimensional hypercube , for which all marginal distributions are uniform. The importance of copula functions is due to Sklårs theorem, which tells that the joint distribution of the random vector can be expressed in terms of marginal distribution functions and a copula function as . So, for example, independence of events can be modeled by the simple product copula . Many other classes of copula functions and a comprehensive discussion of the topic as such can be found in [8].

#### Convexity and Concavity:

Let be a vector-space. We call a set convex, if for any two points , the entire line connecting to is also contained in . Let be a function and take two values . The function is called convex, if for every two values , the line between and upper-bounds between and . More formally, let be the straight line from to , then is convex if for all . A function is called concave if is convex.

#### Hyperreal Numbers and Ultrafilters:

Take the set of infinite sequences over the real numbers . On this set, we can define the arithmetic operations and elementwise on two sequences and by setting and . The ordering of the reals, however, cannot be carried over in this way, as the sequences and would satisfy on some components and on some others. To fix this, we need to be specific on which indices matter for the comparison, and which do not. The resulting family of index-sets can be characterized to be a so-called free ultrafilter, which is defined as follows: a family is called a filter, if the following three properties are satisfied:

• closed under supersets: and implies

• closed under intersection: implies

If, in addition, implies that contains the complement set of , then is called an ultrafilter. A simple example of a filter is the Fréchet-filter, which is the family . A filter is called free, if it contains no finite sets, or equivalently, if any filter that contains is equal to , i.e., is maximal w.r.t. the -relation. An application of Zorn’s lemma to the semi-ordering induced by shows the existence of free ultrafilter as being -maximal elements, extending the Fréchet-filter.

An ultrafilter naturally induces an equivalence relation on by virtue of calling two sequences -equivalent, if and only if , i.e., the set of indices on which and coincide belongs to . The - and -relations can be defined in exactly the same fashion. The family of equivalence classes modulo makes up the set of hyperreal numbers, i.e., , where . In lack of an exact model of due to the non-constructive existence assurance of the necessary free ultrafilter, unfortunately, we are unable to practically do arbitrary arithmetic in . It will be shown (later and in part two of this report) that everything that needs to be computed practically works without being explicitly known.

#### Elements of Game Theory:

Let be a finite set. Let be a finite set of actions, and denote by the cartesian product , i.e., the product of all excluding .

A finite non-cooperative -person game is a triple , where the set contains all player’s payoff functions, and the family comprises the strategy sets of all players. The attribute finite is given to the game if and only if all are finite. An equilibrium strategy is an element , so that all have

 ui(x∗i,x∗−i)≥ui(xi,x∗−i). (1)

That is, action gives the maximal outcome for the -th player, provided that all other players follow their individual equilibrium strategies. Otherwise said, no player has an incentive to solely deviate from , as this would only worsen the revenue from the gameplay444It should be mentioned that this not necessarily rules out benefits for coalitions of players upon jointly deviating from the equilibrium strategy. This, however, is subject of cooperative game-theory, which we do not discuss here any further.. It is easy to construct examples in which no such equilibrium strategy exists. To fix this, one usually considers repetitions of the gameplay, and defines the revenue for a player as the long-run average of all payoffs in each round. Technically, this assures the existence of equilibrium strategies in all finite games (Nash’s theorem). We will implicitly rely on this possibility here too, while explicitly looking at the outcome of the game in a single round. As this is – by our fundamental hypotheses in this report – a random variable itself, condition (1) can no longer be soundly defined, as random variables are not canonically ordered. The core of this work will therefore be on finding a substitute for the -relation, so as to properly restate (1) when random variables appear on both sides of the inequality.

## 2 Optimal Decisions under Uncertainty

Under the above setting, we can collect all scenarios of actions that player 1 (defender) and player 2 (attacker) may take in a tabular (matrix) fashion. Our goal in this first step, is to soundly define what “a best action” would be in light of uncertain, indeed random, effects that actions on either side cause, especially in lack of control about the other’s actions. For that matter, we will consider the scenario matrix as given below, as the payoff structure of some matrix-game, whose mathematical underpinning is the standard setup of game-theory (see [3] for example), with differences and necessary changes to classical theory of games, being discussed in sections 3 and later.

Let the following tableau be a collection of all scenarios of actions taken by the defender (row-player) and attacker (column-player),

 A=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝R11⋯R1j⋯R1m⋮⋱⋮⋱⋮Ri1⋯Rij⋯Rim⋮⋱⋮⋱⋮Rn1⋯Rnj⋯Rnm⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠,

where the rows of the matrix are labeled by the actions in , and the columns of carry the labels of actions from .

A security strategy for player 1 is an optimal choice of a row so that the risk, expressed by the random variable is “optimized” over all possible actions of the opponent. Here, we run into trouble already, as there is no canonical ordering on the set of probability distributions.

To the end of resolving this issue, let us consider repetitions of the gameplay in which each player can choose his actions repeatedly and differently, in an attempt to minimize risk (or damage). This corresponds to situations in which “the best” configuration simply does not exist, and we are forced to repeatedly change or reconsider the configuration of the system in order to remain protected.

In a classical game-theoretic approach, this takes us to the concept of mixed strategies, which are discrete probability distributions over the action spaces of the players. Making this rigorous, let for denote the simplex over , i.e., the space of all discrete probability distributions supported on . More formally, given the support , the set is

 S(X):={(p1,…,pk)∈\mathdsRk:k=|X|,k∑i=1pi=1,pi≥0∀i}.

A randomized decision is thus a rule to choose from the available actions from the action set (which is or hereafter) with corresponding probabilities . We assume the ordering of the actions to be arbitrary but fixed (for obvious reasons).

Now, we return to the problem of what effect to expect when the current configuration of the system is randomly drawn from , and the adversary’s action is another random choice from . For that matter, let us simplify notation by putting and let the two mixed strategies be for player 1, and for player 2.

Since the choice from the matrix is random, where the row is drawn with likelihoods as specified by , and the column is drawn from , the law of total probability yields for the outcome ,

 Pr(R≤r)=∑i,jPr(Rij≤r|i,j)Pr(i,j), (2)

where is the conditional probability of given a particular choice , and is the (unconditional) probability for this choice to occur. Section 2.1 gives some more details on how can be modeled and expressed.

Denote by the distribution of the game’s outcome under strategies , then depends on , and (2) can be rewritten as

 Pr(R≤r)=(F(p,q))(r)=∑i,jFij(r)Cp,q(i,j), (3)

where will be assumed as continuous in and for technical reasons that will become evident later (during the proof of proposition 3.2). Note that the distribution via the function explicitly depends on the choices , and is to be “optimally shaped” w.r.t. to these two variables. The argument to the function is the (random) “revenue”, whose uncertainty is outside any of the two player’s influence (besides shaping by proper choices of and ).

The “revenue” in the game can be of manifold nature, such as

• Risk response of society; a quantitative measure that could rate people’s opinions and confidence in the utility infrastructure

• Repair cost to recover from an incident’s implied damage,

• Reliability, if the game is about whether or not a particular quality of service can be kept up,

• etc.

###### Remark 2.1

In the simplest case of independent actions, we would set when . This choice, along with assuming to be constants rather than random variables, recreates the familiar matrix-game payoff functional from (3). Hence, (3) is a first generalization of matrix games to games with uncertain outcome, which for the sake of flexibility and generality, is “distribution-valued”.

###### Remark 2.2

It may be in reality the case that actions of the two players are not chosen independently, for example, if both of the players possess some common knowledge or access to a common source of information. In game-theoretic terms, this would lead to so-called correlated equilibria (see [3]), in which the players share two correlated random variables that influence their choices. Things here are nevertheless different, as no bidirectional flow of information can be assumed like for correlated equilibria (the attacker won’t inform the utility infrastructure provider about anything in advance, while information from the provider may somehow leak out to the adversary).

### 2.1 Choosing Actions (In)dependently

The concrete choice of the function is only subject to continuity in for technical reasons that will receive a closer look now. The general joint probability of the scenario w.r.t. the marginal discrete distribution vectors is in (2). Under independence of the random choices can be written as .

Now, let us consider cases where the choices are not independent, say, if one player observes the other player’s actions and can react on them (or if both players have access to common source of information).

Sklar’s theorem implies the existence of a copula-function so that the joint distribution can be written in terms of the copula and the marginal distributions , corresponding to the vector , and , corresponding to the vector ,

 F(X,Y)(i,j)=Pr(X≤i,Y≤j)=C(FX(i),FY(j)).
 Pr(i,j) =Pr(X=i,Y=i)=Pr(X≤i,Y≤j)−Pr(X≤i−1,Y≤j) −Pr(X≤i,Y≤j−1)+Pr(X≤i−1,Y≤j−1) =C(FX(i),FY(j))−C(FX(i−1),FY(j)) −C(FX(i),FY(j−1))+C(FX(i−1),FY(j−1)) =C(pi,qj)−C(pi−1,qj)−C(pi,qj−1)+C(pi−1,qj−1). (4)

Thus, the function can be constructed from (4) based on the copula (which must exist). Continuity of thus hinges on the continuity of the copula function. At least two situations admit a choice of that makes continuous:

• Independence of actions:

• Complete lack of knowledge about the interplay between the action choices, in which case we can set .

This choice is justified upon the well-known Fréchet-Hoeffding bound, which says that every -dimensional copula function satisfies

 C(u1,u2,…,un)≤min{u1,…,un}.

Since the -function is itself a copula, it can be chosen if a dependency is known to exist, but with no details on the particular nature of the interplay. Observe that this corresponds to the well-known maximum-principle of system security, where the overall system risk is determined from the maximum risk among its components (alternatively, you may think of a chain to be as strong as its weakest element; which corresponds to the -function among all indicators ).

### 2.2 Comparing Payoff Distributions

There appears to be no canonical way to compare payoff distributions, as can be determined by an arbitrary number of parameters, thus introducing ambiguity in how to compare them. To see this, simply consider the set of normal distributions being determined by two parameters and . Since the pair uniquely determines the distribution function, a comparison between two members amounts to a criterion to compare two-dimensional vectors . It is well-known that is not ordered (as being isomorphic to , on which provably no order exists; see [1] for a proof), and hence there is no natural ordering on the set of probability distributions either.

Despite this sounding like bad news, we can actually construct an alternative characterization of probability distributions on a new space, in which the distributions of interest, in our case will all be members of a totally ordered subset.

To this end, we will rely on a characterization of a probability distribution of the random variable via the sequence of its moments. The -th such moment is from (3) and by assumption 1.3 found to be

 [E(Rk)](p,q) =∫∞−∞xkdF(p,q)=∫∞−∞xk∑i,jfij(x)Cp,q(i,j)dx =∑i,jCp,q(i,j)∫∞−∞xkfij(x)dx=∑i,jCp,q(i,j)E(Rkij), (5)

where the sum runs over and , and is the probability density of for all . Notice that the boundedness condition in assumption 1.3 assures existence and finiteness of all these moments. However, assumption 1.3 yields even more: since is a random variable within (nonnegativity) and has finite moments by the boundedness assumption, the distribution is uniquely determined by the sequence of moments. This is made rigorous by the following lemma:

###### Lemma 2.3

Let two random variables have their moment generating functions exist within a neighborhood . Assume that for all . Then and have the same distribution.

The proof is merely a collection of well-known facts about moment generating functions and the identity of their power-series expansions. For convenience and completeness, we nevertheless give the proof in (almost) full detail.

Proof (of lemma 2.3).  Let be a general random variable. The finiteness of the moment-generating function within some open set with yields via the -th order derivative of [2, Theorem 3.4.3]. Furthermore, if the moment generating function exists within , then it has a Taylor-series expansion (cf. [5, Sec.11.6.1]).

 μZ(s)=∞∑k=0μ(k)Z(0)k!sk,∀s∈(−s0,s0). (6)

Identity of moments between and (the lemma’s hypothesis) thus implies the identity of the Taylor-series expansions of and and in turn the identity on . This equation finally implies that and have the same distribution by the uniqueness theorem of moment-generating functions [2, Theorem 3.4.6].

Lemma 2.3 is the permission to characterize random variables only by their moment-sequence to uniquely pin-down the probability distribution, i.e., we will hereafter write , and use

 (mR(k))k∈\mathdsN,to represent the random variableR∼F(p,q). (7)

Let denote the set of all sequences, on which we define a partial ordering by virtue of the above characterization as follows: let be two distributions defined by (3). As a first try, we could define a preference relation between two distributions by comparing their moment sequences element-wise, i.e., we would prefer over if the respective moments satisfy for all whenever and .

It must be stressed that without extra conditions, this ordering is at most a partial one, since we could allow infinitely alternating values for the moments in both sequences. To make the ordering total, we have to be specific on which indices matter and which don’t. The result will be a standard ultrapower construction, so let denote an arbitrary ultrafilter. Fortunately, the preference ordering by comparing moments elementwise is ultimately independent of the particular ultrafilter in use. This is made precise in theorem 2.5 that is implied by a simple analysis of continuous distributions. We treat these first and discuss the discrete case later, as all of our upcoming findings remain valid under the discrete setting.

#### The Continuous Case:

###### Lemma 2.4

For any two probability distributions and associated random variables that satisfy assumption 1.3, there is a so that either or .

Proof. Let denote the densities of the distributions . Fix the smallest so that covers both the supports of and . Consider the difference of the -th moments, given by

 Δ(k):=E(Rk1)−E(Rk2) =∫Ωxkf1(x)dx−∫Ωxkf2(x)dx =∫Ωxk(f1−f2)(x)dx. (8)

Towards a lower bound to (8), we distinguish two cases:

1. If for all , then and because are continuous, their difference attains a minimum on the compact set . So, we can lower-bound (8) as , as .

2. Otherwise, we look at the right end of the interval , and define

 a∗:=inf{x≥1:f1(x)>f2(x)}.

Without loss of generality, we may assume . To see this, note that if , then the continuity of implies within a range for some , and is the supremum of all these . Otherwise, if on an entire interval for some , then on (the opposite of the previous case) implies the existence of some so that , and is the supremum of all these (see figure 1 for an illustration). In case that , we would have on , which is either trivial (as for all if ) or otherwise covered by the previous case.

In either situation, we can fix a compact interval and two constants (which exist because are bounded as being continuous on the compact set ), so that the function

 ℓ(k,x):={−λ1xk,if 1≤x

lower-bounds the difference of densities in (8) (see figure 1), and

 Δ(k)=∫b∗1xk(f1−f2)(x)dx ≥∫b1ℓ(x,k)dx =−λ1∫a1xkdx+λ2∫baxkdx =−ak+1k+1(λ1+λ2)+λ2bk+1k+1→+∞,

as due to and because are constants that depend only on .

In both cases, we conclude that, unless , for sufficiently large where is finite.

###### Theorem 2.5

Let be the set of distributions that satisfy assumption 1.3. Assume the elements of to be represented by hyperreal numbers in , where is any free ultrafilter. There exists a total ordering on the set that is independent of .

Proof. Let be two probability distributions, and let . Lemma 2.4 assures the existence of some so that, w.l.o.g, we may define the ordering iff whenever . Let be the set of indices where , then complement set is finite (it has at most elements). Let be an arbitrary ultrafilter. Since is finite, it cannot be contained in as is free. And since is an ultrafilter, it must contain the complement a set, unless it contains the set itself. Hence, , and the claim follows. The converse case is treated analogously.

Now, we can state our preference criterion on distributions on the quotient space , in which each probability distribution of interest is represented by its sequence of moments. Thanks to theorem 2.5, there is no need to construct the ultrafilter in order to well-define best responses, since two distributions will compare in the same fashion under any admissible choice of .

###### Definition 2.6 (Preference Relation over Probability Distributions)

Let be two random variables whose distributions satisfy assumption 1.3. We prefer over relative to an ultrafilter , written as

 F1⪯F2:⟺∃K∈\mathdsN s.t. ∀k≥K:mR1(k)≤mR2(k) (9)

Strict preference of over is denoted as

 F1≺F2:⟺∃K∈\mathdsN s.t. ∀k≥K:mR1(k)

Theorem 2.5 establishes this definition to be compatible with (in the sense of being a continuation of) the ordering on the hyperreals , being defined as iff , when are represented by sequences .

By virtue of the -relation, we can define an equivalence between two distributions in the canonical way as

 F1≡F2:⟺(F1⪯F2)∧(F2⪯F1). (10)

Within the quotient space , we thus consider two distributions as identical, if only a finite set of moments between them mismatch. Observe that this does not imply the identity of the distribution functions themselves, unless actually all moments match.

The strict preference relation induces an ordering topology on , whose open sets are for any two distributions ,

 (F1,F2):={F∈F:F1≺F≺F2},

and the topology is denoted as .

#### The Discrete Case:

In situations where the game’s payoffs are better modeled by discrete random variables, say if a nominal scale (“low”, “medium”, “high”) or a scoring scheme is used to express revenue, assumption 1.3 is too strong in the sense of prescribing a continuous density where the model density is actually discrete.

Assumption 1.3 covers discrete distributions that possess a density w.r.t. the counting measure. The line of arguments as used in the proof of Lemma 2.4 remains intact without change, except for the obvious difference that is a finite (and hence discrete) set now. Likewise, all conclusions drawn from lemma 2.4, including theorem 2.5, as well as the definitions of ordering and topology transfer without change.

### 2.3 Comparing Discrete and Continuous Distributions

The representation (7) of distributions by the sequence of their moments works even without assuming the density to be continuous. Therefore, it elegantly lets us compare distributions of mixed type, i.e., continuous vs. discrete distributions on a common basis.

It follows that we can – without any changes to the framework – compare discrete to continuous distributions, or any two distributions of the same type in terms of the -, - and -relations. This comparison is, obviously, only meaningful if the respective random variables live in the same (metric) space. For example, it would be meaningless to compare ordinal to numeric data.

### 2.4 Comparing Deterministic to Random

In certain occasions, the consequence of an action may result in perfectly foreseeable effects, such as fines or similar. Such deterministic outcomes can be modeled as degenerate distributions (point- or Dirac-masses)555Note that the canonic embedding of the reals within the hyperreals represents a number by the constant sequence . Picking up this idea would be critically flawed in our setting, as any such constant sequence would be preferred over any probability distribution (whose moment sequence diverges and thus overshoots inevitably and ultimately).. These are singular and thus violate assumption 1.3, since there is no density function associated with them, unless one is willing to resort to generalized functions; which we do not do in this report. Nevertheless, it is possible to work out the representation in terms of moment sequences. If is a random variable that deterministically takes on the constant value all the time, then the respective moment sequence has elements for all . Given another non-degenerate distribution with density function , supported on , we can lower- or upper-bound the moments of the respective random variable by exponential functions in , which can straightforwardly -, - or -compared to the representative of the (deterministic) outcome . Algorithmic details will follow in part two of this research report.

### 2.5 Extensions: Relaxing Assumption 1.3

Risk management is often required to handle or avoid extreme (catastrophic) events. The respective statistical models are distributions with so-called “heavy”, “long” or “fat” tails (exact definitions and distribution models will follow in part two of this report). Extreme-value distributions such as the Gumbel-distribution, or also the Cauchy-distribution (that is not an extreme value model) are two natural examples that fall into the class of distributions that assign unusually high likelihood to large outcomes (that may be considered as catastrophic consequences of an action). In any case, our assumption 1.3 rules out such distributions by requiring compact support. Even worse, the -relation based on the representation of a distribution by the sequence of its moments cannot be extended to cover distributions with heavy tails, as those typically do not have finite moments or moment-generating functions. Nevertheless, such distributions are important tools in risk management.

Things are, however, not drastically restricted by assumption 1.3, for at least two reasons: First, compactness of the support is not necessary for all moments to exist, as the Gaussian distribution has moments of all orders and is supported on the entire real line (thus violating even two of the three conditions of assumption 1.3). Still, it is characterized entirely by its first two moments, and thus can easily be compared in terms of the -relation.

Second, and more importantly, any distribution with infinite support can be approximated by a truncated distribution. Given a random variable with distribution function , then the truncated distribution is the distribution of conditional on falling into a finite range, i.e., the truncated distribution function gives the conditional likelihood

 ^F(x)=Pr(X≤x|a≤X≤b).

Provided that has a density function , the truncated density function is

 ^f(x)=⎧⎨⎩f(x)F(b)−F(a),a≤x≤b;0,otherwise.

In other words, we simply crop the density outside the interval and re-scale the resulting function to become a probability distribution again.

Since every distribution function is non-decreasing and satisfies , any choice of admits a value such that . Moreover, since our random variables are all non-negative, we have , since is right-continuous. It follows that the truncated distribution density for variables of interest in our setting simplifies to . Now, let us compare a distribution to its truncated version in terms of the probabilities that we would compute:

 ∣∣F(x)−^F(x)∣∣ =∣∣∣∫x0f(t)dt−∫x0f(t)/F(b)dt∣∣∣ =∣∣∫x0f(t)(1−1F(b))<εdt∣∣<ε∫∞0f(t)dt=ε,

for sufficiently large , which depends on the chosen that determines the quality of approximation. Conversely, can find always find a truncated distribution that approximates up to an arbitrary precision . This shows that restricting ourselves to distributions with compact support, i.e., adopting assumption 1.3, causes no more than a numerical error that can be made as small as we wish.

More interestingly, we could attempt to play the same trick as before, and characterize a distribution with fat, heavy or long tails by a sequence of approximations to it, arising from better and better precisions . In that sense, we could hope to compare approximations rather than the true density in an attempt to extend the preference and equivalence relations and to distributions with fat, heavy or long tails.

Unfortunately, such hope is wrong, as a distribution is not uniquely characterized by a general sequence of approximations (i.e., we cannot formulate an equivalent to lemma 2.3), and the outcome of a comparison of approximations is not invariant to how the approximations are chosen (i.e., there is also no alike for lemma 2.4). To see the latter, take the quantile function for a distribution , and consider the tail quantiles . Pick any sequence with . Since , the tail quantile sequence behaves like , where the limit is independent of the particular sequence , but only the speed of divergence is different for distinct sequences.

Now, let two distributions with infinite support be given. Fix two sequences and , both vanishing as , and set

 an:=¯¯¯¯F−11(αn)≤bn:=¯¯¯¯F−12(ωn). (11)

Let us approximate by a sequences of truncated distributions with supports and let the sequence approximate on . Since for all , the proof of lemma 2.4 then implies that the approximations with support is always strictly preferable to the distribution with support , thus . However, by replacing the “” by a “” in (11), we can construct approximations to whose supports exceed one another in the reverse way, so that the approximations would always satisfy . It follows that the sequence of approximations cannot be used to unambiguously compare distributions with infinite support, unless we impose some constraints on the tails of the distributions and the approximations. The next lemma assumes this situation to simply not occur, which allows to give a sufficient condition to unambiguously extend strict preference in the way we wish.

###### Lemma 2.7

Let be two distributions supported on the entire nonnegative real half-line with continuous densities . Let be an arbitrary sequence with as , and let for be the truncated distribution supported on .

If there is a constant and a value such that for all