Belief likelihood function for generalised logistic regression

Belief likelihood function for generalised logistic regression

Fabio Cuzzolin
School of Engineering, Computing and Mathematics
Oxford Brookes University
Oxford, UK
Abstract

The notion of belief likelihood function of repeated trials is introduced, whenever the uncertainty for individual trials is encoded by a belief measure (a finite random set). This generalises the traditional likelihood function, and provides a natural setting for belief inference from statistical data. Factorisation results are proven for the case in which conjunctive or disjunctive combination are employed, leading to analytical expressions for the lower and upper likelihoods of ‘sharp’ samples in the case of Bernoulli trials, and to the formulation of a generalised logistic regression framework.

Belief likelihood function for generalised logistic regression

Fabio Cuzzolin School of Engineering, Computing and Mathematics Oxford Brookes University Oxford, UK

1 Introduction

Logistic regression [4] is a popular statistical method for modelling data in which one or more independent observed variables determine an outcome, represented by a binary variable. The framework can also be extended to the multinomial case, and is widely used in various fields, including machine learning, medical diagnosis, and social sciences, to cite a few. Despite its successes, the method has serious limitations. In particular, it has been shown to consistently and sharply underestimate the probability of ‘rare’ events [19]. The term [25] denotes cases in which the training data are of insufficient quality, in the sense that they do not represent well enough the underlying distribution. As a result, scientists are forced to infer probability distributions using information captured in ‘normal’ times (e.g. while a nuclear power plant is working nominally), whereas these distributions are later used to extrapolate results at the ‘tail’ of the curve.

Although corrections to logistic regression have been proposed [19], the root cause of the problem, in our view, our very models of uncertainty are themselves affected by uncertainty: a phenomenon often called ‘Knightian’ uncertainty. The latter can be explicitly modelled by considering convex sets of probability distributions, or ‘credal sets [24, 23]. Random sets [26, 27, 29, 15], in particular, are a sub-class of credal sets induced by probability distributions on the collection of all subsets of the sample space. In the finite case random sets are often called belief functions, a term introduced by Glenn Shafer [28] from a subjective probability perspective.

As we show here, the logistic regression framework can indeed be generalised to the case of belief functions, which themselves generalise classical discrete probability measures. Given a sample space , the traditional likelihood function is equal to the conditional probability of the data given a parameter , i.e., a family of probability distribution functions (PDFs) over parameterised by : , As originally proposed by Shafer and Wasserman [28, 32, 33], belief functions can indeed be built from traditional likelihood functions. However, as we argue here, one can directly define a belief likelihood function, mapping a sample observation to a real number, as a natural set-valued generalisation of the conventional likelihood. It is natural to define such a belief likelihood function as family of belief functions on , , parameterised by . As the latter take values on sets of outcomes, , of which singleton outcomes are mere special cases, they provide a natural setting for computing likelihoods of set-valued observations, in accordance with the random set philosophy.

When applied to samples generated by series of independent trials, under a generalisation of stochastic independence, belief likelihoods factorise into simple products. The resulting lower and upper likelihoods can be easily computed for series of Bernoulli trials, and allows us to formulate a generalised logistic regression framework, in which the mass values of individual trials are constrained to follow a logistic dependence on scalar parameters. The values of the parameters which optimise the lower and upper likelihoods induce a pair of ‘lower’ and ‘upper’ belief functions on the parameter space, whose interval effectively encodes the uncertainty associated with the amount of data at our disposal. Every new observation, possibly in areas of the sample space not previously explored, is mapped to a pair of lower and upper logistic belief functions, which together provide lower and upper estimates for the belief values of each event.

1.1 Contributions

The contributions of the paper are thus as follows:

(1) a belief likelihood function for repeated trials is defined, whenever the uncertainty on individual trials is assumed to be encoded by a belief measure;

(2) elegant factorisation properties are proven for events that are Cartesian products, whenever belief measures are combined by conjunctive rule, leading to the notions of lower and upper likelihoods;

(3) factorisation results are also provided in the case in which the dual, disjunctive combination is used to compute belief and plausibility likelihoods;

(4) analitical expressions of lower and upper likelihoods are provided for the case of Bernoulli trials;

(4) finally, a generalised logistic regression based on lower and upper likelihoods is formulated and analysed, as an alternative inference mechanism to generate belief functions from statistical data.

1.2 Paper outline

After reviewing in Section 2 the logistic regression framework, we recall in Section 3 the necessary notions of the theory of belief functions. In Section 4 the belief likelihood function of repeated trials is defined. In Section 5, the belief likelihood of a series of binary trials is analysed in the conjunctive case. Factorisation results are shown which reduce upper and lower likelihoods of ‘sharp’ samples to products of belief values of individual binary observations, and can be generalised to arbitrary Cartesian products of focal elements. In Section 6 an analysis of the belief likelihood function in the disjunctive case is conducted. General factorisation results holding for series of observations from arbitrary sample spaces are illustrated in Section 7, while analytical expressions for the Bernoulli case are given in Section 8. Finally, a generalised logistic regression framework is outlined (Section 9) in which the masses of the two outcomes are constrained to have a logistic dependency, and dual optimisation problems lead to a pair of lower and upper estimates for the belief measure of the outcomes. Section 10 concludes the paper and points at future work.

2 Logistic regression

Logistic regression allows us, given a sample , where is a binary outcome at time and is the corresponding observed measurement, to learn the parameters of a conditional probability relation between the two, of the form:

 P(Y=1|x)=11+e−(β0+β1x), (1)

where and are two scalar parameters. Given a new observation , (1) delivers the probability of a positive outcome . Logistic regression generalises deterministic linear regression, as it is a function of the linear combination . The trials are assumed independent but not equally distributed, for varies with the time instant of collection. The two scalar parameters in (1) are estimated by maximum likelihood of the sample. After denoting by

 πi=P(Yi=1|xi)=11+e−(β0+β1xi),1−πi=P(Yi=0|xi)=e−(β0+β1xi)1+e−(β0+β1xi) (2)

the conditional probabilities of the two outcomes, the likelihood of the sample can be expressed as: where and is a function of . Maximising yields a conditional PDF .
Unfortunately, logistic regression shows clear limitations when the number of samples is insufficient or when there are too few positive outcomes (1s) [19]. Moreover, inference by logistic regression tends to underestimate the probability of a positive outcome [19].

3 Belief functions

3.1 Belief and plausibility measures

Definition 1.

A basic probability assignment (BPA) [1] over a finite domain is a set function [10, 11] defined on the collection of all subsets of s.t.:

 m(∅)=0,∑A⊂Θm(A)=1.

The quantity is called the basic probability number or ‘mass’ [22, 21] assigned to . The elements of the power set associated with non-zero values of are called the focal elements of .

Definition 2.

The belief function (BF) associated with a basic probability assignment is the set function defined as:

 Bel(A)=∑B⊆Am(B). (3)

The domain on which a belief function is defined is usually interpreted as the set of possible answers to a given problem, exactly one of which is the correct one. For each subset (‘event’) the quantity takes on the meaning of degree of belief that the truth lies in , and represents the total belief committed to a set of possible outcomes by the available evidence .

Another mathematical expression of the evidence generating a belief function is the upper probability or plausibility of an event : , as opposed to its lower probability [6]. The corresponding plausibility function conveys the same information as , and can be expressed as:

 Pl(A)=∑B∩A≠∅m(B)≥Bel(A).

3.2 Evidence combination

The issue of combining the belief function representing our current knowledge state with a new one encoding the new evidence is central in belief theory. After an initial proposal by Dempster, several other aggregation operators have been proposed, based on different assumptions on the nature of the sources of evidence to combine.

Definition 3.

The orthogonal sum or Dempster’s combination of two belief functions , defined on the same domain is the unique BF on with as focal elements all the non-empty intersections of focal elements of and , and basic probability assignment:

 m⊕(A)=m∩(A)1−m∩(∅), (4)

where denotes the BPA of the input BF , and:

Rather than normalising (as in (4)), Smets’ conjunctive rule leaves the conflicting mass with the empty set:

 m\textcircled$∩$(A)=m∩(A)∅⊆A⊆Θ, (5)

and is thus applicable to ‘unnormalised’ beliefs [31].

In Dempster’s rule, consensus between two sources is expressed by the intersection of the supported events. When the union is taken to express consensus we obtain the disjunctive rule of combination [20, 34]:

 m\textcircled$∪$(A)=∑B∪C=Am1(B)m2(C), (6)

which yields more cautious inferences than conjunctive rules, by producing belief functions that are less ‘committed’, i.e., have larger focal sets. Under disjunctive combination: , input belief values are simply multiplied.

3.3 Conditioning

Belief functions can also be conditioned, rather than combined, whenever we are presented hard evidence of the form ‘ is true’ [3, 12, 18, 14, 9, 35, 17].

In particular, Dempster’s combination naturally induces a conditioning operator. Given a conditioning event , the ‘logical’ (or ‘categorical’, in Smets’ terminology) belief function such that is combined via Dempster’s rule with the a-priori belief function . The resulting BF is the conditional belief function given a la Dempster, denoted by .

3.4 Multivariate analysis

In many applications, we need to express uncertain information about a number of distinct variables (e.g., and ) taking values in different domains ( and , respectively). The reasoning process needs then to take place in the Cartesian product of the domains associated with each individual variable.

Let then and be two sample spaces associated with two distinct variables, and let be a mass function on . The latter can be expressed in the coarser domain by transferring each mass to the projection of on . We obtain a marginal mass function on , denoted by:

 mXY↓X(B)≐∑{A⊆ΘXY,A↓ΘX=B}mXY(A),∀B⊆ΘX.

Conversely, a mass function on can be expressed in by transferring each mass to the cylindrical extension of . The vacuous extension of onto will then be:

 m↑XYX(A)≐{mX(B)if A=B×ΩY,0otherwise. (7)

The associated BF is denoted by .

4 Belief likelihood of repeated trials

Let , for be a parameterised family of belief functions on , the space of quantities that can be observed at time , depending on a parameter . A series of repeated trials then assumes values in , whose elements are tuples of the form . We call such tuples ‘sharp’ samples, as opposed to arbitrary subsets of the space of trials. Note that we are not assuming the trials to be equally distributed at this stage, nor we assume that they come from the same sample space.

Definition 4.

The belief likelihood function of a series of repeated trials is defined as:

 BelX1×⋯×Xn(A|θ)≐Bel↑×iXiX1⊙⋯⊙Bel↑×iXiXn(A|θ), (8)

where is the vacuous extension (7) of to the Cartesian product where the observed tuples live, and is an arbitrary combination rule.

In particular, when the subset reduces to a sharp sample, , we can define the following generalisations of the notion of likelihood.

Definition 5.

We call the quantities

 L––(→x)≐BelX1×⋯×Xn({(x1,...,xn)}|θ),¯¯¯¯L(→x)≐PlX1×⋯×Xn({(x1,...,xn)}|θ) (9)

lower likelihood and upper likelihood, respectively, of .

5 Binary trials: the conjunctive case

Belief likelihoods factorise into simple products, whenever conjuctive combination is employed (as a generalisation of classical stochastic independence) in Definition 4, and trials with binary outcomes are considered.

5.1 Focal elements of the belief likelihood

Let us first analyse the case . We seek the Dempster’s sum , where .
Figure 1 is a diagram of all the intersections of focal elements of the two input BF on .

There are distinct, non-empty intersections, which correspond to the focal elements of . According to Equation (4), the mass of focal element , , , is then:

 mBelX1⊕BelX2(A1×A2)=mX1(A1)⋅mX2(A2). (10)

Note that the result holds when using the conjunctive rule as well (5), for none of the intersections is empty, hence no normalisation is required. Nothing is assumed about the mass assignment of and .

We can now prove the following Lemma.

Lemma 1.

For any the belief function , where , has focal elements, namely all possible Cartesian products of non-empty subsets of , with BPA:

 mBelX1⊕⋯⊕BelXn(A1×...×An)=n∏i=1mXi(Ai).
Proof.

The proof is by induction. The thesis was shown to be true for in Equation (10). In the induction step, we assume that the thesis is true for , and prove it for . If , defined on , has as focal elements the -products with for all , its vacuous extension to will have as focal elements the -products of the form: , with for all .

The belief function is defined on , with three focal elements: , and . Its vacuous extension to thus has the following three focal elements: , and .
When computing on the common refinement we need to compute the intersection of their focal elements, namely:

 (A1×...×An×Xn+1)∩(X1×⋯×Xn×An+1)=A1×...×An×An+1

for all , . All such intersections are distinct for distinct focal elements of the two belief functions to combine, and there are no empty intersection. By Dempster’s rule (4) their mass is equal to the product of the original masses, i.e.:

 mBelX1⊕⋯⊕BelXn+1(A1×...×An×An+1)=mBelX1⊕⋯⊕BelXn(A1×...×An)⋅mBelXn+1(An+1).

Since we assumed that the factorisation holds for , the thesis easily follows. ∎

As no normalisation is involved in the combination , Dempster’s rule coincides with the conjunctive rule and Lemma 1 holds for as well.

5.2 Factorisation for ‘sharp’ tuples

The following becomes then a simple corollary.

Theorem 1.

When using either or as a combination rule in the definition of belief likelihood function, the following decomposition holds for tuples , , which are the singletons elements of , with :

 BelX1×⋯×Xn({(x1,...,xn)}|θ)=n∏i=1BelXi({xi}|θ), (11)
Proof.

For the singleton elements of , since , Equation (11) becomes: , where the mass factorisation follows from Lemma 1, as on singletons mass and belief values coincide. ∎

There is evidence to support the following as well.

Conjecture 1.

When using either $⃝\scriptstyle{\cap}$ or as a combination rule in the definition of belief likelihood function, the following decomposition holds for the associated plausibility values on tuples , , which are the singletons elements of , with :

 (12)

Indeed we can write:

 =1−BelX1×⋯×Xn({(x1,...,xn)}c)==1−∑B⊆{(x1,...,xn)}cmBelX1⊕⋯⊕BelXn(B). (13)

By Lemma 1 all the subsets with non-zero mass are Cartesian products of the form , . We then need understand the nature of the focal elements of which are subsets of an arbitrary singleton complement .

For binary spaces , by definition of Cartesian product, each such is obtained by replacing a number of components of the tuple with a different subset of (either or ). There are such sets of components in a list of . Of these components, in general will be replaced by , while the other will be replaced by . Note that not all components can be replaced by , since the resulting focal element would contain the tuple .

The following argument can be proved for , under the additional assumption that are equally distributed with , and .
If this is the case, for fixed values of and all the resulting focal elements have the same mass value, namely: , where , and . As there are exactly such focal elements, (13) can be written as:

 1−n∑k=1(nk)k∑m=1(km)pn−kqmrk−m.

which can be rewritten as:

 1−n∑m=1qmn∑k=m(nk)(km)pn−krk−m.

A change of variable , where when , when , allows us to write it as:

 1−n∑m=1qmn−m∑l=0(nn−l)(n−lm)plr(n−m)−l,

since , . Now, as

 (nn−l)(n−lm)=(nm)(n−ml)

we obtain: By Newton’s binomial, the latter is equal to since . Again, by Newton’s binomial, we get:

 PlX1×⋯×Xn({(T,...,T)})=1−[1−(1−q)n]=(1−q)n=n∏i=1PlXi({T}).

5.3 Factorisation for Cartesian products

Decomposition (11) is equivalent to what Smets calls conditional conjunctive independence [30]. In fact, for binary spaces factorisation (11) generalises to all subsets of samples which are Cartesian products of subsets of , respectively: , for all .

Corollary 1.

Whenever , , under conjunctive combination we have that:

 BelX1×⋯×Xn(A1×⋯×An|θ)=n∏i=1BelXi(Ai|θ). (14)
Proof.

As by Lemma 1 all the focal elements of are Cartesian products of the form , , it follows that is equal to:

 ∑B⊆A1×⋯×An,B=B1×⋯×BnmX1(B1)⋅...⋅mXn(Bn).

But since if for some the resulting Cartesian product would not be a subset of . Thus,

 =∑B=B1×⋯×Bn,Bi⊆Ai∀imX1(B1)⋅...⋅mXn(Bn). (15)

For all ’s, that are singletons of , necessarily and we can write (15) as:

 m(Ai1)⋅...⋅m(Aim)∑Bj⊆Aj,j≠i1,...,im∏j≠i1,...,immXj(Bj).

If the frames are binary, , those ’s that are not singletons coincide with , so that we have:

 m(Ai1)⋅...⋅m(Aim)∑Bj⊆Xj,j≠i1,...,im∏jmXj(Bj).

The quantity is, according to the definition of conjunctive combination, the sum of the masses of all the possible intersections of (cylindrical extensions of) focal elements of , , thus they add up to 1. In conclusion (forgetting the conditioning on in the derivation for sake of readability): and we have (14). ∎

Corollary 1 states that conditional conjunctive independence always holds for events that are Cartesian products, whenever the involved frames are binary.

6 Binary trials: the disjunctive case

Similar factorisation results hold when using the (more cautious) disjunctive combination .

6.1 Structure of the focal elements

As in the conjunctive case, we first analyse the case . We seek the disjunctive combination , where each has as focal elements , and . Figure 2 is a diagram of all the unions of focal elements of the two input BFs on their common refinement .

There are distinct such unions, the focal elements of , with masses:

 m({(xi,xj)}c)=mX1({xi}c)⋅mX2({xj}c),m(X1×X2)=1−∑i,jmX1({xi}c)⋅mX2({xj}c).

We can now prove the following Lemma.

Lemma 2.

The belief function , where , has focal elements, namely all the complements of the -tuples of singleton elements , with BPA:

 mBelX1\textcircled$∪$⋯\textcircled$∪$BelXn({(x1,...,xn)}c)==mX1({x1}c)⋅⋯⋅mXn({xn}c), (16)

plus the Cartesian product itself, with mass value given by normalisation.

Proof.

The proof is by induction. The case was proven above. In the induction step, we assume that the thesis is true for , namely that the focal elements of have the form:

 A={(x1,...,xn)}c={(x′1,...,x′n)|∃i:{x′i}={xi}c}, (17)

where . We need to prove it true for .
The vacuous extension of (17) has trivially the form:

 A′={(x′1,...,x′n,xn+1)|∃i:{x′i}={xi}c,xn+1∈X}.

Note that only singletons of are not in , for any given tuple .
The vacuous extension to of a focal element of is instead:

 B′={(y1,⋯,yn,xn+1)|yi∈X∀i=1,...,n}.

Now, all the elements of , except for , are also elements of . Hence, the union reduces to the union of and . The only singleton element of not in is therefore , , for it is neither in nor in . All such unions are distinct. Thus, by definition of , their mass is which by inductive hypothesis is equal to (16). Unions involving either or are equal to by the property of the union operator. ∎

6.2 Factorisation

Theorem 2.

In the hypotheses of Lemma 2, when using disjunctive combination $⃝\scriptstyle{\cup}$  in the definition of belief likelihood function, the following decomposition holds:

 BelX1×⋯×Xn({(x1,...,xn)}c|θ)=n∏i=1BelXi({xi}c|θ). (18)
Proof.

As contains only itself as a focal element:

 BelX1×⋯×Xn({(x1,...,xn)}c|θ)=m({(x1,...,xn)}c|θ).

By Lemma 2 the latter becomes

 BelX1×⋯×Xn({(x1,...,xn)}c|θ)=n∏i=1mXi({xi}c|θ)=n∏i=1BelXi({xi}c|θ),

as is a singleton element of , and we have (18). ∎

Note that for all tuples , as the set has non-empty intersection with all the focal elements of .

7 General factorisation results

The argument of Lemma 1 is in fact valid for the conjunctive combination of belief functions defined on an arbitrary collection of finite spaces.

Theorem 3.

For any the belief function , where are finite spaces, has as focal elements all the Cartesian products of focal elements , with BPA:

 mBelX1⊕⋯⊕BelXn(A1×...×An)=n∏i=1mXi(Ai).

The proof is similar to that of Lemma 1, and is omitted for lack of space. It follows that:

Corollary 2.

When using either $⃝\scriptstyle{\cap}$  or as a combination rule in the definition of belief likelihood function, the following decomposition holds for tuples , , which are the singletons elements of , with any finite frames of discernment:

 BelX1×⋯×Xn({(x1,...,xn)}|θ)=n∏i=1BelXi({xi}|θ). (19)
Proof.

For the singleton elements of , since