Data-Discriminants of Likelihood Equations This research paper is partly supported by 2014 NIMS Thematic Program on Applied Algebraic Geometry.

# Data-Discriminants of Likelihood Equations 1

## Abstract

Maximum likelihood estimation (MLE) is a fundamental computational problem in statistics. The problem is to maximize the likelihood function with respect to given data on a statistical model. An algebraic approach to this problem is to solve a very structured parameterized polynomial system called likelihood equations. For general choices of data, the number of complex solutions to the likelihood equations is finite and called the ML-degree of the model.

The only solutions to the likelihood equations that are statistically meaningful are the real/positive solutions. However, the number of real/positive solutions is not characterized by the ML-degree. We use discriminants to classify data according to the number of real/positive solutions of the likelihood equations. We call these discriminants data-discriminants (DD). We develop a probabilistic algorithm for computing DDs. Experimental results show that, for the benchmarks we have tried, the probabilistic algorithm is more efficient than the standard elimination algorithm. Based on the computational results, we discuss the real root classification problem for the by symmetric matrix model.

## 1 Introduction

We begin the introduction with an illustrative example. Suppose we have a weighted four-sided die such that the probability of observing side of the die satisfies the constraint . We toss the die 1000 times and record a -dimensional data vector , where is the number of times we observe the side . We want to determine the probability distribution that best explains the data subject to the constraint. One approach is by maximum likelihood estimation (MLE):

Maximize the likelihood function subjected to

 p0+2p1+3p2−4p3=0,p0+p1+p2+p3=1,
 p0>0,p1>0,p2>0,andp3>0.

For some statistical models, the MLE problem can be solved by well known hill climbing algorithms such as the EM-algorithm. However, the hill climbing method can fail if there is more than one local maximum. Fortunately, it is known that the MLE problem can be solved by solving the system of likelihood equations [15, 2]:

 F0 =p0λ1+p0λ2−u0 F3 =p3λ1−4p3λ2−u3 F1 =p1λ1+2p1λ2−u1 F4 =p0+2p1+3p2−4p3 F2 =p2λ1+3p2λ2−u2 F5 =p0+p1+p2+p3−1

where and are newly introduced indeterminates (Lagrange multipliers) for formulating the likelihood equations. More specifically, for given , if is a critical point of the likelihood function, then there exist complex numbers and such that is a solution of the polynomial system. For randomly chosen data , the likelihood equations have complex solutions. However, only solutions with positive coordinates are statistically meaningful. A solution with all positive coordinates is said to be a positive solution. So an important problem is real root classification (RRC):

For which , the polynomial system has and real/positive solutions?

According to the theory of computational (real) algebraic geometry [26, 20], the number of (real/positive) solutions only changes when the data goes across some “special” values (see Theorem 2). The set of “special” is a (projective) variety (see Lemma 4 in [20]) in ( dimensional complex projective space) -dimensional complex space. The number of real/positive solutions is uniform over each open connected component determined by the variety. In other words, the “special” plays the similar role as the discriminant for univariate polynomials. The first step of RRC is calculating the “special” , leading to the discriminant problem:

How to effectively compute the “special” ?

Geometrically, the “special” is a projection of a variety. So in principle, it can be computed by elimination ( see Chapter 3, page 115–128 in [6]). For instance, by the command eliminate in Macaulay2 [10], we compute that the “special” in the illustrative example form a hypersurface defined by a homogenous polynomial in (see Example 1). However, for most MLE problems, due to the large size of likelihood equations, the elimination computation is too expensive. In this paper, we discuss the “discriminant” problem for the likelihood equations. The contributions of the paper are listed as follows.

• For likelihood equations, we show that the “special” form a projective variety. We call the homogenous polynomial that generates the codimension component of the projective variety the data-discriminant. This name distinguishes it from the weight-discriminant for the likelihood equations (which replaces the condition with the condition with parameters ).

• For algebraic statistical models, we develop a probabilistic algorithm to compute data-discriminants. We implement the algorithm in Macaulay2. Experimental results show that the probabilistic algorithm is more efficient than the standard elimination algorithm.

• We discuss the real root classification for the symmetric matrix model, which inspire future work.

We remark that our work can be viewed as following the numerous efforts in applying computational algebraic geometry to tackle MLE and critical points problems [15, 2, 1, 16, 25, 12, 8, 13, 18, 14, 21].

The paper is organized as follows. The formal definition of the data-discriminant is introduced in Section 2. The standard elimination algorithm and the probabilistic algorithm are presented in Section 3. Experimental results comparing the two algorithms are shown in Section 4. The real root classification of the symmetric matrix model and conclusion are given in Section 5.

## 2 Definition

In this section, we discuss how to define “data-discriminant”. We assume the readers are familiar with elimination theory (see Chapter 3 in [6]).

###### Notation 1.

Let denote the projective closure of the complex numbers . For homogeneous polynomials in , denotes the projective variety in defined by . Let denote the -dimensional probability simplex .

###### Definition 1.

[15](Algebraic Statistical Model and Model Invariant) The set is said to be an algebraic statistical model if where define an irreducible generically reduced projective variety. Each is said to be a model invariant of .

For a given algebraic statistical model, there are several different ways to formulate the likelihood equations [15]. In this section, we introduce the Lagrange likelihood equations and define the data-discriminant for this formulation. One can similarly define data-discriminants for other formulations of the likelihood equations.

###### Notation 2.

For any in the polynomial ring , denotes the affine variety in defined by and denotes the ideal generated by . For an ideal in , denotes the affine variety defined by .

###### Definition 2.

[13](Lagrange Likelihood Equations and Correspondence) Given an algebraic statistical model . The system of polynomial equations below is said to be the Lagrange likelihood equations of :

 F0=p0(λ1+∂g1∂p0λ2+⋯+∂gs∂p0λs+1)−u0=0 ⋯ Fn=pn(λ1+∂g1∂pnλ2+⋯+∂gs∂pnλs+1)−un=0
 Fn+1=g1(p0,…,pn)=0 ⋯ Fn+s=gs(p0,…,pn)=0 Fn+s+1=p0+⋯+pn−1=0

where are the model invariants of and , , are indeterminates (also denoted by , , ). More specifically,

–  are unknowns,

–  are parameters.

, namely the set

 {(u,p,Λ)∈Cn+1×Cn+1×Cs+1|F0=0,…,Fn+s+1=0},

is said to be the Lagrange likelihood correspondence of and denoted by .

###### Notation 3.

Let denote the canonical projection from the ambient space of the Lagrange likelihood correspondence to the associated to the indeterminants : .

Given an algebraic statistical model and a data vector , the maximum likelihood estimation (MLE) problem is to maximize the likelihood function subject to . The MLE problem can be solved by computing . More specifically, if is a regular point of , then is a critical point of the likelihood function if and only if there exist such that . Theorem 1 states that for a general data vector , is a finite set and the cardinality of is constant over a dense Zariski open set, which inspires the definition of ML-degree. For details, see [15].

###### Theorem 1.

[15] For an algebraic statistical model , there exist an affine variety and a non-negative integer such that for any ,

 #π−1(u)∩LX=N.
###### Definition 3.

[15](ML-Degree) For an algebraic statistical model , the non-negative integer stated in Theorem 1 is said to be the ML-degree of .

###### Notation 4.

For any in , denotes the ideal

 {D∈Q[u]|D(a0,…,an)=0,∀(a0,…,an)∈S}.

denotes the affine closure of in , namely .

###### Definition 4.

For an algebraic statistical model , suppose are defined as in Definition 2. Let denote

 det⎡⎢ ⎢ ⎢ ⎢ ⎢⎣∂F0∂p0⋯∂F0∂pn∂F0∂λ1⋯∂F0∂λs+1⋮⋱⋮⋮⋱⋮∂Fn+s+1∂p0⋯∂Fn+s+1∂pn∂Fn+s+1∂λ1⋯∂Fn+s+1∂λs+1⎤⎥ ⎥ ⎥ ⎥ ⎥⎦.

Then, we have the following:

•  denotes the set of non-properness of , i.e., the set of the such that there does not exist a compact neighborhood of where is compact;

•  denotes ;

•  denotes .

The geometric meaning of and are as follows. The first, , is the projection of the intersection of the Lagrange likelihood correspondence with the coordinate hyperplanes. The second, , is the projection of the intersection of the Lagrange likelihood correspondence with the hypersurface defined by . Geometrically, is the closure of the union of the projection of the singular locus of and the set of critical values of the restriction of to the regular locus of (see Definition 2 in [20]).

The Lagrange likelihood equations define an affine variety. As we continuously deform the parameters , coordinates of a solution can tend to infinity. Geometrically, is the set of the data such that the Lagrange likelihood equations have some solution at infinity; this is the closure of the set of “non-properness” as defined in the page 1, [19] and page 3, [23]. It is known that the set of non-properness of is closed and can be computed by Gröbner bases (see Lemma 2 and Theorem 2 in [20]).

The ML-degree encaptures geometry of the likelihood equations over the complex numbers. However, statistically meaningful solutions occur over real numbers. Below, Theorem 2 states that and define open connected components such that the number of real/positive solutions is uniform over each open connected component. Theorem 2 is a corollary of Ehresmann’s theorem for which there exists semi-algebraic statements since 1992 [5].

###### Theorem 2.

For any algebraic statistical model ,

•  if are the open connected components of

 Rn+1∖(LX∞∪LXJ),

then for each , for any ,

 #π−1(u)∩LX∩Rn+s+2

is a constant;

•  if are the open connected components of

 Rn+1∖(LX∞∪LXJ∪LXp),

then for each , for any ,

 #π−1(u)∩LX∩(Rn+1>0×Rs+1)

is a constant.

Before we give the definition of data-discriminant, we study the structures of , , and below.

• Proposition 1 shows that the structure of the likelihood equations forces to be contained in the union of coordinate hyperplanes defined by .

• Proposition 2 shows that the structure of the likelihood equations forces to be a projective variety.

• Similarly as the proof of Proposition 2, we can also show that the structure of the likelihood equations forces to be a projective variety.

###### Proposition 1.

For any algebraic statistical model ,

 LXp⊂Va(Πnk=0uk).

Proof. By Definition 2, for any ,

 uk=pk(λ1+∂g1∂p1λ2+⋯+∂gs∂p1λs+1)−Fk.

Hence,

 uk∈⟨Fk,pk⟩∩Q[uk]⊂⟨F0,…,Fn+s+1,pk⟩∩C[u]

So

 Va(⟨F0,…,Fn+s+1,pk⟩∩C[u])⊂Va(uk)

By the Closure Theorem [6],

 Va(⟨F0,…,Fn+s+1,pk⟩∩C[u])=¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯π(LX∩Va(pk))

Therefore,

 LXp =¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯π(LX∩Va(Πnk=0pk)) =¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯π(LX∩∪nk=0Va(pk)) =∪nk=0¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯π(LX∩Va(pk)) ⊂∪nk=0Va(uk) =Va(Πnk=0uk).□
###### Remark 1.

Generally, . For example, suppose the algebraic statistical model is . Then .

###### Notation 5.

denotes the product .

###### Proposition 2.

For an algebraic statistical model , we have is a projective variety in , where is the zero vector in .

Proof. By the formulation of the Lagrange likelihood equations, we can prove that is a homogeneous ideal by the two basic facts below, which can be proved by Definition 2 and basic algebraic geometry arguments.

C1. For every in , each scalar multiple is also in .

C2. For any , if for any and for any scalar , , then is a homogeneous ideal in .

That means the ideal is generated by finitely many homogeneous polynomials , , . Therefore, . So .

###### Notation 6.

For an algebraic statistical model , we define the notation according to the codimension of in .

• If the codimension is , then assume are the codimension irreducible components in the minimal irreducible decomposition of in and , , are radical. denotes the homogeneous polynomial .

• If the codimension is greater than , then our convention is to take .

Similarly, we use the notation to denote the projective variety . Now we define the “data-discriminant” of Lagrange likelihood equations.

###### Definition 5.

(Data-Discriminant) For a given algebraic statistics model , the homogeneous polynomial is said to be the data-discriminant (DD) of Lagrange likelihood equations of and denoted by .

###### Remark 2.

Note that DD can be viewed as a generalization of the “discriminant” for univariate polynomials. So it is interesting to compare DD with border polynomial (BP) [26] and discriminant variety (DV) [20]. DV and BP are defined for general parametric polynomial systems. DD is defined for the likelihood equations but can be generalized to any square and generic zero-dimensional system. Generally, for any square and generic zero-dimensional system, DD DV BP. Note that due to the special structure of likelihood equations, DD is a homogenous polynomial despite being an affine system of equations. However, generally, DV is not a projective variety and BP is not homogenous.

Example 1 (Linear Model). The algebraic statistic model for the four sided die story in Section 1 is given by

 X=V(p0+2p1+3p2−4p3)∩Δ3.

The Langrange likelihood equations are the shown in Section 1. The Langrange likelihood correspondence is . If we choose generic , , namely the ML-degree is . The data-discriminant is the product of , and , where

, , and

.

By applying the well known partial cylindrical algebraic decomposition (PCAD) [4] method to the data-discriminant above, we get that for any ,

• if , then the system of likelihood equations has distinct real solutions and of them is positive;

• if , then the system of likelihood equations has exactly real solution and it is positive.

The answer above can be verified by the RealRootClassification [26, 3] command in Maple 17. In this example, the does not effect the number of real/positive solutions since it is always positive when each is positive. However, generally, plays an important role in real root classification. Also remark that the real root classification is equivalent to the positive root classification for this example but it is not true generally (see Example 6).

## 3 Algorithm

In this section, we discuss how to compute . We assume that is a given statistical model, are defined as in Definition 2, and is defined as in Definition 4. We rename as . Subsection 3.1 presents the standard elimination algorithm for reference and Subsection 3.2 presents our main algorithm (Algorithm 2).

### 3.1 Standard Elimination Algorithm

Considering the data-discriminant as a projection drives a natural algorithm to compute it. This is the standard elimination algorithm in symbolic computation:

• we compute the by elimination and then get by the radical equidimensional decomposition (see Definition 3 in [20]). The algorithm is formally described in the Algorithm 1;

• we compute by the Algorithm PROPERNESSDEFECTS presented in [20] and then get by the radical equidimensional decomposition. We omit the formal description of the algorithm.

The previous algorithms in this subsection can not be used to compute DDs of algebraic statistical models in a reasonable time, see Tables 1–2 in Section 4. This motivates the exploration of a more practical method found in the next subsection.

### 3.2 Probabilistic Algorithm

First, we prepare the lemmas, then we present the main algorithm (Algorithm 2).

• Lemma 1 is used to linearly transform parameter space.

• Corollary 1 and Lemma 2 are used to compute the totally degree of .

• Corollary 2 is used in the sampling for interpolation.

###### Lemma 1.

For any , there exists an affine variety in such that for any , the total degree of equals the degree of w.r.t. to , where

 B(t0,t1,…,tn)=G(t0,a1t0+t1,…,ant0+tn)

Proof. Suppose the total degree of is and is the homogeneous component of with total degree . For any , let . It is easily seen that the degree of w.r.t. equals .

###### Corollary 1.

For any , there exists an affine variety in such that for any

 (a0,b0,…,an,bn)∈C2n+2∖V,

the total degree of equals the degree of where

 B(t)=G(a0t+b0,…,ant+bn).
###### Lemma 2.

There exists an affine variety in such that for any , if

 ⟨A(t)⟩=⟨F0(t),…,Fn(t),Fn+1,…,Fm,J⟩∩Q[t]

where is the polynomial by replacing with in and

 B(t)=DXJ(a0t+b0,…,ant+bn),

then .

Proof. By the definition of (Notation 6), there exists an affine variety such that for any , is radical. Thus, we only need to show that there exists an affine variety in such that for any , .

Suppose is the canonical projection: . For any

 t∗∈πt(Va(F0(t),…,Fn(t),Fn+1,…,Fm,J)),

let (for ), then . Hence and so . Thus

 B(t)∈I(πt(Va(F0(t),…,Fn(t),Fn+1,…,Fm,J)).

Therefore,

 Va(A(t)) =Va(I(πt(Va(F0(t),…,Fn(t),Fn+1,…,Fm,J))) ⊂Va(B(t)).

For any , let for , then . By the Extension Theorem [6], there exists an affine variety such that if , then , thus

 t∗∈πt(Va(F0(t),…,Fn(t),Fn+1,…,Fm,J))⊂Va(A(t)).□
###### Corollary 2.

There exists an affine variety in such that for any , if

 ⟨A(u0)⟩=⟨F0,F∗1…,F∗n,Fn+1,…,Fm,J⟩∩Q[u0]

where is the polynomial by replacing with in () and

 B(u0)=DXJ(u0,a1,…,an),

then .

We show an example to explain the basic idea of the probabilistic algorithm and how the lemmas work in the algorithm.

Example 2 (Toy Example for Interpolation Idea). Suppose the radical of the elimination ideal is generated by , where and . We already know that is homogenous and equals . Rather than by the standard elimination algorithm, we compute