On the efficient estimation of the mean of multivariate truncated normal distributionsThis research has been co-financed by the European Union (European Social Fund - ESF) and Greek national funds through the Operational Program “Education and Lifelong Learning” of the National Strategic Reference Framework (NSRF) - Research Funding Program: ARISTEIA- HSI-MARS-1413.

# On the efficient estimation of the mean of multivariate truncated normal distributions††thanks: This research has been co-financed by the European Union (European Social Fund - ESF) and Greek national funds through the Operational Program “Education and Lifelong Learning” of the National Strategic Reference Framework (NSRF) - Research Funding Program: ARISTEIA- HSI-MARS-1413.

Konstantinos D. Koutroumbas Institute for Astronomy, Astrophysics, Space Applications and Remote sensing, National Observatory of Athens, I. Metaxa and Vas. Paulou, Penteli, GR-152 36, Greece, Tel. No: +30-210-8109189, Fax No: +30-210-6138343, E-mail: koutroum@noa.gr.    Konstantinos E. Themelis    Athanasios A. Rontogiannis
23 January 2014
###### Abstract

A non trivial problem that arises in several applications is the estimation of the mean of a truncated normal distribution. In this paper, an iterative deterministic scheme for approximating this mean is proposed, motivated by an iterative Markov chain Monte Carlo (MCMC) scheme that addresses the same problem. Conditions are provided under which it is proved that the scheme converges to a unique fixed point. The quality of the approximation obtained by the proposed scheme is assessed through the case where the exponential correlation matrix is used as covariance matrix of the initial (non truncated) normal distribution. Finally, the theoretical results are also supported by computer simulations, which show the rapid convergence of the method to a solution vector that (under certain conditions) is very close to the mean of the truncated normal distribution under study.

Keywords: Truncated normal distribution, contraction mapping, diagonally dominant matrix, MCMC methods, exponential correlation matrix

## I Introduction

A non trivial problem that appears in several applications, e.g. in multivariate regression and Bayesian statistics, is the estimation of the mean of a truncated normal distribution. The problem arises in cases where a random vector follows a normal distribution with mean and covariance matrix , denoted by , but is restricted to a closed subset of . However, the restrictions considered in most of the cases are of the form , with , . If and ( and ), we have a one-sided truncation from the left (right), while in the case where both and are reals, we have two-sided truncation.

The problem has attracted the attention of several researchers from the sixties. Since then, several deterministic approaches have been proposed with some of them trying to estimate not only the mean but also other moments (such as the variance) of a multivariate truncated normal distribution. These approaches can be categorized according to whether they are suitable for one-sided truncated normal distributions ([19], [22], [2], [7], [9]) or for two-sided truncated normal distributions ([21], [17], [6], [8], [13], [10]), or according to whether they consider the bivariate, , ([19], [21], [17], [13]) or the multivariate case ([22], [2], [6], [7], [8], [9], [10]). In addition, some of these methods put additional restrictions to the distribution paramaters (e.g. [22], [2], [7] require that ). Most of these methods either perform direct integration (e.g. [22]) or they utilize the moment generating function tool (e.g. [10]).

An alternative strategy to deal with this problem has been followed in [18]. More specifically, in [18], a Markov Chain Monte Carlo (MCMC) iterative scheme has been developed. According to this, at each iteration, succesive samplings take place, one from each one-dimensional conditionals of the truncated normal distribution. The mean of the -th such distribution is the mean of conditioned on the . After performing several iterations, the estimation of the mean of the truncated normal results by performing an averaging over the produced samples. Convergence issues of this scheme to the mean of the truncated normal distribution are a subject of the Markov chain theory. A relative work that accelerates the method in [18] is exhibited in [5].

The work presented in this paper for approximating the mean of a multivariate truncated normal distribution has been inspired from that of [18]. Specifically, instead of selecting a sample from each one of the above one-dimensional distributions, we select its mean. Thus, the proposed scheme departs from the statistical framework adopted in [18] and moves to the deterministic one.

This work is an extension of a relative scheme used in [23] in the framework of spectral unmixing in hyperspectral images. In addition, a convergence proof of the proposed scheme is given when certain conditions are fulfilled. The quality of the approximation of the mean offered by the proposed method is assessed via the case where is the exponential correlation matrix. Experimental results show that the new scheme converges significantly faster than the MCMC approach of [18].

The paper is organized as follows. Section 2 contains some necessary definitions and a brief description of the work in [18]. In Section 3, the newly proposed method is described and in Section 4 conditions are given under which it is proved to converge. In Section 5, the proposed method is applied to the case where is the exponential correlation matrix. In Section 6 simulation results are provided and a relevant discussion is presented. Finally, Section 7 concludes the paper.

## Ii Preliminaries and previous work

Let us consider the -dimensional normal distribution

 N(\boldmathx|\boldmathμ,Σ)=1(2π)n/2|Σ|1/2exp(−12(\boldmathx−\boldmathμ)TΣ−1(\boldmathx−\boldmathμ)) (1)

where the -dimensional vector is its mean and the matrix is its covariance matrix.

Let be a subset of with positive Lebesgue measure. We denote by the truncated normal distribution which results from the truncation of in . Speaking in mathematical terms

 NRn(\boldmathx|\boldmathμ,Σ)=⎧⎪ ⎪⎨⎪ ⎪⎩exp(−12(\boldmathx−% \boldmathμ)TΣ−1(\boldmathx−\boldmathμ))∫Rnexp(−12(\boldmathx−\boldmathμ)TΣ−1(\boldmathx−\boldmathμ))d% \boldmathw,if \boldmathx∈Rn0,otherwise (2)

Note that is proportional to , where , if and , otherwise.

In the scheme discussed in [18], a Markov Chain Monte Carlo (MCMC) method is proposed, to compute the mean of single or doubly truncated (per coordinate) normal distribution , where . The method relies on the sampling of the one-dimensional conditionals of the truncated normal distribution. More specifically, letting and denoting the expectation and the variance of conditioned on the rest coordinates, respectively111That is, follows the (non-truncated) -th conditional of ., and denoting the (one dimensional) truncated normal distribution which results from the truncation of a normal distribution with mean and variance in , the iterative sampling scheme proposed in [18] can be written as

 1.x(t)1∼N[a1,b1](E[x1|x(t−1)2,…,x(t−1)n],σ∗21)2.x(t)2∼N[a2,b2](E[x2|x(t)1,x(t−1)3…,x(t−1)n],σ∗22)⋮n.x(t)n∼N[an,bn](E[xn|x(t)1,x(t)2…,x(t)n−1],σ∗2n) (3)

where denotes the sampling action and denotes the current iteration. After performing several, say , iterations (and after discarding the first few, say , ones) the mean of each coordinate is estimated as

 E[xi]=1k−k′k∑t=k′x(t)i,  i=1,…,n

The quantities and of each one of the above one dimensional conditionals are expressed in terms of the parameters and of the non-truncated multidimensional normal distribution as follows

 μ∗i=μi+\boldmathσT¬iΣ−1¬i¬i(\boldmathx¬i−\boldmathμ¬i),   i=1,…,n (4)
 σ∗i=σii−\boldmathσT¬iΣ−1¬i¬i\boldmathσ¬i,   i=1,…,n (5)

with being the matrix that results from after removing its -th column and its -th row, being the -th column of excluding its -th element and , being the (-dimensional) vectors that result from and , respectively, after removing their -th coordinates, and , respectively. Note that depends on all ’s except .

## Iii The proposed model

In the sequel, we focus on the case where is a set of the form , where for each interval it is either, (i) and or (ii) and . This means that, along each dimension, the truncation is one-sided and more specifically, case (i) corresponds to right truncation while case (ii) corresponds to left truncation.

The proposed model for estimating the mean of , is of iterative nature and, at each iteration, it requires the computation of the (one-dimensional) function. This model has a close conceptual affinity with the one (briefly) presented in the previous section ([18]). More specifically, instead of utilizing the samples produced by the (one dimensional) distributions , we use the corresponding mean values (here denoted by ).

As it is well known (see e.g. [16]), the mean of a truncated one dimensional normal distribution , which has resulted from the (non-truncated) normal distribution with mean and variance , is expressed as

 w=μ∗+ϕ(a−μ∗σ∗)−ϕ(b−μ∗σ∗)Φ(b−μ∗σ∗)−Φ(a−μ∗σ∗)σ∗ (6)

where , and .

However, since in the present paper we consider the cases where either (i) and or (ii) and , let us see now how eq. (6) becomes for each of these cases.

(i) , . In this case it is and as a consequence, , . Thus, taking also into account the definitions of and and the fact that , eq. (6) gives,

 w=μ∗+√2πe−(μ∗−a)22σ∗2erfc(−μ∗−a√2σ∗)σ∗ (7)

(ii) , . In this case it is and, as a consequence, , . Working as in case (i), eq. (6) gives

 w=μ∗−√2πe−(b−μ∗)22σ∗2erfc(−b−μ∗√2σ∗)σ∗ (8)

Eqs. (7) and (8) can be expressed compactly via the following single equation

 w=μ∗+√2π[f(μ∗−a√2σ∗)Ia∈R−f(b−μ∗√2σ∗)Ib∈R]σ∗ (9)

where () is an indicator function which equals to if () and otherwise and

 f(x)=e−x2erfc(−x) (10)

Let us now return to the multidimensional case. Since from the (truncated) conditional one-dimensional normals we no longer perform sampling but, instead, we consider their means, eq. (4) is altered to

 μ∗i=μi+\boldmathσT¬iΣ−1¬i¬i(\boldmathw¬i−\boldmathμ¬i),   i=1,…,n (11)

where results from the current estimate of the ( - dimensional) mean vector of the truncated normal distribution, after removing its -th coordinate.

Putting now all the previous ingredients together (that is, utilizing eqs. (9), (11) and (5)) we obtain the following iterative scheme

 (12)

with being computed via eq. (11), where (the only parameter in (11) that varies through iterations) is defined as , that is, the most recent information about ’s is utilized. More formally, we can say that the above scheme performs sequential updating and (following the terminology used in [3]) it is a Gauss Seidel updating scheme.

It is reminded that, due to the type of truncation considered here (only left truncation or only right truncation per coordinate), the bracketed expression in each equation of (12) contains only one non identically equal to zero term. In the sequel, we consider separately the cases corresponding to and , i.e.,

 w(t)i=μ∗(t)i+√2πf(A(t)i)σ∗i (13)

and

 w(t)i=μ∗(t)i−√2πf(B(t)i)σ∗i (14)

where

 A(t)i=μ∗(t)i−ai√2σ∗i (15)

and

 B(t)i=bi−μ∗(t)i√2σ∗i (16)

, respectively, and is defined as in eq. (10).

## Iv Convergence issues

In this section we provide sufficient conditions under which the proposed scheme is proved to converge222However, it is noted that even if these conditions are slightly violated, the algorithm still works, as is verified by the experiments presented in Section 5.. Before we proceed, we give some propositions and remind some concepts that will be proved useful in the sequel.

Proposition 1: Assume that is a symmetric positive definite matrix, is the -th column of , after removing its -th element and results from after removing its -th row and its -th column. Also, let be the element of and be the -dimensional vector resulting from the -th row of after (i) removing its -th element, , and (ii) multiplying the remaining elements by . Then, it holds

(i) and

(ii) .

The proof of this proposition is straightforward from the inversion lemma for block partitioned matrices ([20], p. 53) and the use of permutation matrices, in order to define the Schur complement for each row of .

Proposition 2: It is

 −√π≤f′(x)≤0,∀x∈R (17)

where denotes the derivative of , which is defined in eq. (10).

The proof of proposition 2 is given in the appendix. Definition 1: A mapping , where , is called contraction if for some norm there exists some constant (called modulus) such that

 ||T(\boldmathx)−T(\boldmathy)||≤α||% \boldmathx−\boldmathy||,  ∀\boldmathx,%\boldmath$y$∈X (18)

The corresponding iteration is called contracting iteration.

Proposition 3 ([3], pp. 182-183): Suppose that is a contraction with modulus and that is a closed subset of . Then

(a) The mapping has a unique fixed point 333A point is called fixed point of a mapping if it is ..

(b) For every initial vector , the sequence , generated by converges to geometrically. In particular,

 ||\boldmathx(t)−\boldmathx∗||≤αt||% \boldmathx(0)−\boldmathx∗||,  ∀t≥0

Let us define the mappings , as

 Ti(\boldmathx) =μ∗i(\boldmathx)+ +√2π[f(μ∗i(\boldmathx)−ai√2σ∗i)Iai∈R−f(bi−μ∗i(\boldmathx)√2σ∗i)Ibi∈R]σ∗i (19)

where

 μ∗i(\boldmathx)=μi+\boldmathσT¬iΣ−1¬i¬i(\boldmathx¬i−\boldmath% μ¬i) (20)

and

 Ai(\boldmathx)=μ∗i(\boldmathx)−ai√2σ∗i,Bi(\boldmathx)=bi−μ∗i(\boldmathx)√2σ∗i (21)

and the mapping as

 T(\boldmathx)=(T1(\boldmathx),…,Tn(% \boldmathx))

Let us define next the mapping as

 ^Ti(\boldmathx)=^Ti(x1,…,xn)=(x1,…,xi−1,Ti(\boldmathx),xi+1,…,xn)

Performing the sequential updating as described by eq. (12) (one at a time and in increasing order) is equivalent to applying the mapping , defined as

 S=^Tn∘^Tn−1∘…∘^T2∘^T1 (22)

where denotes function composition. Following the terminology given in [3], is called the Gauss Seidel mapping based on the mapping and the iteration is called the Gauss Seidel algorithm based on mapping .

A direct consequence of [3, Prop. 1.4, pp.186] is the following Proposition: Proposition 4: If is a contraction with respect to the norm, then the Gauss-Seidel mapping is also a contraction (with respect to the norm), with the same modulus as . In particular, if is closed, the sequence of the vectors generated by the Gauss-Seidel algorithm based on the mapping converges to the unique fixed point of geometrically. Having given all the necessary definitions and results, we will proceed by proving that (a) for each mapping it holds , , where , are the and norms, respectively, (b) if is diagonally dominant then is a contraction and (c) provided that is a contraction, the algorithm converges geometrically to the unique fixed point of . We remind here that the -dimensional vector results from the -th row of , exluding its -th element and dividing each element by the negative of .

Proposition 5: For the mappings , , it holds

 |Ti(\boldmathx)−Ti(\boldmathy)|≤||% \boldmathσT¬iΣ−1¬i¬i||1||% \boldmathx−\boldmathy||∞,  ∀\boldmathx% ,\boldmathy∈Rn (23)

Proof: (a) We consider first the case where . Let us consider the vectors . Since is constant, utilizing eq. (20) it follows that

 μ∗i(\boldmathx)−μ∗i(\boldmathy)=% \boldmathσT¬iΣ−1¬i¬i(\boldmathx% ¬i−\boldmathy¬i) (24)

Also, it is

 Ai(\boldmathx)−Ai(\boldmathy)=1√2σ∗i\boldmathσT¬iΣ−1¬i¬i(\boldmathx¬i−\boldmathy¬i) (25)

Taking the difference we have

 Ti(\boldmathx)−Ti(\boldmathy) =(μ∗i(\boldmathx)−μ∗i(\boldmath% y))+ +√2π(f(Ai(\boldmathx))−f(Ai(\boldmathy)))σ∗i (26)

Since is continuous in , the mean value theorem guarantees that there exists such that

 f(Ai(\boldmathx))−f(Ai(\boldmathy))=f′(ξi)(Ai(\boldmathx)−Ai(\boldmathy)) (27)

Substituting eq. (27) to (26) we get

 Ti(\boldmathx)−Ti(\boldmathy) =(μ∗i(\boldmathx)−μ∗i(\boldmath% y))+ +√2πf′(ξi)(Ai(\boldmathx)−Ai(\boldmathy))σ∗i (28)

Substituting (24) and (25) into (28) and after some manipulations it follows that

 Ti(\boldmathx)−Ti(\boldmathy)=(1+f′(ξi)√π)\boldmathσT¬iΣ−1¬i¬i(\boldmathx¬i−\boldmathy¬i) (29)

Taking absolute values in eq. (29) and applying Hölder’s inequality , for and , it follows that

 |Ti(\boldmathx)−Ti(\boldmathy)|≤||(1+f′(ξi)√π)\boldmathσT¬iΣ−1¬i¬i||1||\boldmathx¬i−\boldmathy¬i||∞ (30)

Taking into account that (from Proposition 2), and the (trivial) fact that , it follows that

 |Ti(\boldmathx)−Ti(\boldmathy)|≤||% \boldmathσT¬iΣ−1¬i¬i||1||% \boldmathx−\boldmathy||∞ (31)

Thus, the claim has been proved.

(b) We consider now the case where . Working similarly to the previous case, the difference is

 Bi(\boldmathx)−Bi(\boldmathy)=−1√2σ∗i\boldmathσT¬iΣ−1¬i¬i(\boldmathx¬i−\boldmathy¬i) (32)

while the difference is expressed as

 Ti(\boldmathx)−Ti(\boldmathy) =(μ∗i(\boldmathx)−μ∗i(\boldmath% y))+ +√2π(f(Bi(\boldmathx))−f(Bi(\boldmathy)))σ∗i (33)

Utilizing the mean value theorem we have that there exists such that

 f(Bi(\boldmathx))−f(Bi(\boldmathy))=f′(ξi)(Bi(\boldmathx)−Bi(\boldmathy)) (34)

Substituting eq. (34) to (33) we get

 Ti(\boldmathx)−Ti(\boldmathy) =(μ∗i(\boldmathx)−μ∗i(\boldmath% y))+ +√2πf′(ξi)(Bi(\boldmathx)−Bi(\boldmathy))σ∗i (35)

Substituting eqs. (24) and (32) into (35), we obtain

 Ti(\boldmathx)−Ti(\boldmathy)=(1+f′(ξi)√π)\boldmathσT¬iΣ−1¬i¬i(\boldmathx¬i−\boldmathy¬i) (36)

From this point on, the proof is exactly the same with that of (a). Q.E.D. Proposition 6: The mapping is a contraction in , with respect to the norm, provided that is diagonally dominant. Proof: Let . Taking into account proposition 5, it easily follows that

 ||T(\boldmathx)−T(\boldmathy)||∞≡maxi=1,…,n{|Ti(\boldmathx)−Ti(\boldmathy)|}≤
 ≤maxi=1,…,n{||\boldmathσT¬iΣ−1¬i¬i||1}||\boldmathx−\boldmathy||∞

Now, (a) taking into account that the -dimensional vector results from the -th row of , exluding its -th element and dividing each element by the negative of and (b) recalling that is the -th row of excluding its -th element , it is

 ||\boldmathσT¬iΣ−1¬i¬i||1=||\boldmaths¬isii||1

Provided that is diagonally dominant, it is

 ||\boldmaths¬i||1≤|sii|,  ∀i

or

 ||\boldmaths¬isii||1<1,  ∀i

which proves the claim. Q.E.D.

Theorem 1: The algorithm converges geometrically to the unique fixed point of , provided that is diagonally dominant. Proof: The proof is a direct consequence of the propositions 3, 4 and 6 exposed before, applied for . Q.E.D.

## V Assessment of the accuracy of the proposed method

An issue that naturally arises with the proposed method is how accurate the estimate of the mean is. Since it is very difficult to give a theoretical analysis of this issue, mainly due to the highly complex nature of the propsoed iterative scheme (see eq. (12)), we will try to gain some insight for this subject via experimentation. To this end, we set equal to the exponential correlation matrix, which is frequently met in various fields of applications, e.g., in signal processing applications. Its general form is

 Σn=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣1ρρ2…ρn−1ρ1ρ…ρn−2ρ2ρ1…ρn−3⋮⋮⋮⋱⋮ρn−1ρn−2ρn−3…1⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦,  (0≤ρ<1) (37)

It is easy to verify that the inverse of is expressed as

 Σ−1n=11−ρ2⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣1−ρ0…00−ρ1+ρ2−ρ…000−ρ1+ρ2…00⋮⋮⋮⋱⋮⋮000…1+ρ2−ρ000…−ρ1⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦ (38)

Also, it is straightforward to see that is diagonally dominant for all values of . Thus, it is a suitable candidate for our case. In addition, it is “controlled” by just a single parameter (), which facilitates the extraction of conclusions. Note that for , becomes the identity matrix, while as increases towards the diagonal dominancy of decreases (while its condition number increases). For close to , is alomost singular.

In the sequel, we consider the case of a zero mean normal distribution with covariance matrix as in (38), which is truncated in the region , that is the truncation point is the same along all dimensions. Without loss of generality, this choice has been deliberately selected, in order to keep our experimental framework controlled by just two parameters, and . Performing simulations for various combinations of the values of and (and for various dimensions ) we can gain some insight on the accuracy of the approximation of the mean provided by the proposed method. In the sequel, we use as benchmark the estimate of the mean provided by the (widely accepted as reliable) MCMC method ([18]).

Figure 1, shows a three-dimensional graph of the difference (assessed by its Euclidean norm divided by ) between the estimates of the truncated mean obtained by the MCMC and the proposed methods, against and . It is worth noting that for smaller values of (less than ), the difference remains low (less than ), independently of the value of the cutting point . On the other hand, for larger values of (above ), the difference increases. More specifically, it increases more for values of between (approximately) and .

In figure 2, the shaded regions in the space correspond to low difference (less than per dimension), for . It can be deduced that the behaviour of the proposed method is only slightly affected influenced by the dimensionality of the feature space.

As a general conclusion, one can argue that the “more diagonally dominant” the is (i.e., the smaller the is), the more accurate the estimate of the mean provided by the proposed method is. From another perspective, the more “approaches diagonality” (again, as becomes smaller), the more accurate the obtained estimates are. The latter is also supported by the fact that in the extreme case of a diagonal covariance matrix, one has to solve independent one-dimensional problems, for which an analytic formula exists. In this case, it is easy to verify that the proposed method terminates after a single iteration (see also comments in the next section).

## Vi Simulation results

After having gained some insight on the capability of the proposed method to approximate the mean of a multivariate truncated normal distribution in the previous section, we proceed in this section with experiments where now the involved covariance matrices have no specific structure. As in the previous section, the MCMC method is used as benchmark 444In the sequel, all results are rounded to the third decimal..

1st experiment: The purpose of this experiment is to compare the estimates of the mean of a truncated normal distribution obtained by the proposed and the MCMC methods, for dimensions varying from to . To this end, for each dimension , different truncated normal distributions (defined by the means and the covariance matrices of the corresponding untruncated normals, as well as their truncation points) have randomly been generated, such that, the corresponding inverse covariance matrix is diagonally dominant. For the -th such distribution, , both the proposed and the MCMC methods have been applied. Letting and denote the respective resulting estimates, the mean difference per coordinate between the two estimates is computed, i.e., and, averaging over , the quantity is obtained. In figure 3, is plotted versus . From this figure, it can be concluded that the proposed scheme gives estimates that are very close to those given by the MCMC. Thus, the fixed point of the proposed scheme (when the diagonal dominance condition is fulfilled) can be considered as a reliable estimate of the mean of the truncated normal distribution.

Next, in order to show the rapid convergence of the proposed scheme against the MCMC method, we focus on a single example (however, the resulting conclusions are generally applicable). More specifically, Table 1 shows the values of the norm of the difference divided by , i.e. the quantity , as the number of iterations evolves, for the -dimensional left truncated normal distribution defined by ,

 Σ=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣$0.045$$−0.003$$0.013$$−0.004$$0.011$$−0.003$$0.056$$−0.015$$0.008$$0.010$$0.013$$−0.015$$0.074$$−0.001$$0.004$$−0.004$$0.008$$−0.001$$0.156$$−0.012$</