Improved estimation in cumulative link models

# Improved estimation in cumulative link models

Ioannis Kosmidis
Department of Statistical Science, University College London
London, WC1E 6BT, UK
i.kosmidis@ucl.ac.uk
###### Abstract

For the estimation of cumulative link models for ordinal data, the bias-reducing adjusted score equations in Firth (1993) are obtained, whose solution ensures an estimator with smaller asymptotic bias than the maximum likelihood estimator. Their form suggests a parameter-dependent adjustment of the multinomial counts, which, in turn suggests the solution of the adjusted score equations through iterated maximum likelihood fits on adjusted counts, greatly facilitating implementation. Like the maximum likelihood estimator, the reduced-bias estimator is found to respect the invariance properties that make cumulative link models a good choice for the analysis of categorical data. Its additional finiteness and optimal frequentist properties, along with the adequate behaviour of related asymptotic inferential procedures make the reduced-bias estimator attractive as a default choice for practical applications. Furthermore, the proposed estimator enjoys certain shrinkage properties that are defensible from an experimental point of view relating to the nature of ordinal data.
Key words: reduction of bias, adjusted score equations, adjusted counts, shrinkage, ordinal response models.

## 1 Introduction

In many models with categorical responses the maximum likelihood estimates can be on the boundary of the parameter space with positive probability. For example, Albert and Anderson (1984) derive the conditions that describe when the maximum likelihood estimates are on the boundary in multinomial logistic regression models. While there is no ambiguity in reporting an estimate on the boundary of the parameter space, as is for example an infinite estimate for the parameters of a logistic regression model, estimates on the boundary can (i) cause numerical instabilities to fitting procedures, (ii) lead to misleading output when estimation is based on iterative procedures with a stopping criterion, and more importantly, (iii) cause havoc to asymptotic inferential procedures, and especially to the ones that depend on estimates of the standard error of the estimators (for example, Wald tests and related confidence intervals).

The maximum likelihood estimator in cumulative link models for ordinal data (McCullagh, 1980) also has a positive probability of being on the boundary.

###### Example 1.1:

As a demonstration consider the example in Christensen (2012a, §7). The data set in Table 1 comes from Randall (1989) and concerns a factorial experiment for investigating factors that affect the bitterness of white wine. There are two factors in the experiment, temperature at the time of crushing the grapes (with two levels, “cold” and “warm”) and contact of the juice with the skin (with two levels “Yes” and “No”). For each combination of factors two bottles were rated on their bitterness by a panel of 9 judges. The responses of the judges on the bitterness of the wine were taken on a continuous scale in the interval from 0 (“None”) to 100 (“Intense”) and then they were grouped correspondingly into ordered categories, , , , , .

Consider the partial proportional odds model (Peterson and Harrell, 1990) with

 logγrs1−γrs=αs−βswr−θzr(r=1,…,4;s=1,…,4), (1)

where and are dummy variables representing the factors temperature and contact, respectively, are model parameters and is the cumulative probability for the th category at the th combination of levels for temperature and contact. The clm function of the R package ordinal (Christensen, 2012b) is used to maximize the multinomial likelihood that corresponds to model (1). The clm function finds the maximum likelihood estimates up to a specified accuracy, by using a Newton-Raphson iteration for finding the roots of the log-likelihood derivatives. For the current example, the stopping criterion is set to that the log-likelihood derivatives are less than in absolute value. The maximum likelihood estimates, estimated standard errors and the corresponding values for the statistics for the hypothesis that the respective parameter is zero, are extracted from the software output and shown in Table 1. At those values for the maximum likelihood estimator the maximum absolute log-likelihood derivative is less than and the software correctly reports convergence. Nevertheless, an immediate observation is that the absolute value of the estimates and estimated standard errors for the parameters , and is very large. Actually, these would diverge to infinity as the stopping criteria of the iterative fitting procedure used become stricter and the number of allowed iterations increases.

For model (1) interest usually is on testing departures from the assumption of proportional odds via the hypothesis . Using a Wald-type statistic would be adventurous here because such a statistic explicitly depends on the estimates of , , and . Of course, given that the likelihood is close to its maximal value at the estimates in Table 1, a likelihood ratio test can be used instead; the likelihood ratio test for the particular example has been carried out in Christensen (2012a, §7).

Furthermore, the current example demonstrates some of the potential dangers involved in the application of cumulative link models in general; the behaviour of the individual statistics — being essentially for the parameters and in this example — is quite typical of what happens when estimates diverge to infinity. The values of the statistics converge to zero because the estimated standard errors diverge much faster than the estimates, irrespective of whether or not there is evidence against the individual hypotheses. This behaviour is also true for individual hypotheses at values other than zero and can lead to invalid conclusions if the output is interpreted naively. More importantly, the presence of three infinite standard errors in a non-orthogonal (in the sense of Cox and Reid, 1987) setting like the current may affect the estimates of the standard errors for other parameters in ways that are hard to predict. ∎

An apparent solution to the issues mentioned in Example 1.1 is to use a different estimator that has probability zero of resulting in estimates on the boundary of the parameter space. For example, for the estimation of the common difference in cumulative logits from ordinal data arranged in a contingency table with fixed row totals, McCullagh (1980) describes the generalized empirical logistic transform. The generalized empirical logistic transform has smaller asymptotic bias than the maximum likelihood estimator and is also guaranteed to give finite estimates of the difference in cumulative logits because it adjusts all cumulative counts by . However, the applicability of this estimator is limited to the analysis of tables, and particularly in estimating differences in cumulative logits, with no obvious extension to more general cumulative link models, such as the one in Example 1.1.

A family of estimators that can be used for arbitrary cumulative link models and are guaranteed to be finite are ridge estimators. As one of the referees highlighted, if one extends the work in le Cessie and van Houwelingen (1992) for logistic regression to cumulative link models, then the shrinkage properties of the ridge estimator can guarantee its finiteness. Nevertheless, ridge estimators have a series of shortcomings. Firstly, in contrast to the maximum likelihood estimator, the ridge estimators are not generally equivariant under general linear transformations of the parameters for cumulative link models. A reparameterization of the model by re-scaling the parameters or taking contrasts of those — which are often interesting transformations in cumulative link models — and a corresponding transformation of the ridge estimator will not generally result to the estimator that the same ridge penalty would produce for the reparameterized model, unless the penalty is also appropriately adjusted. For example, if one wishes to test the hypothesis in Example 1.1 using a Wald test, then an appropriate ridge estimator would be one that penalizes the size of the contrasts of , , and instead of the size of those parameters themselves. Secondly, the choice of the tuning parameter is usually performed through the use of an optimality criterion and cross-validation. Hence, the properties of the resultant estimators are sensitive to the choice of the criterion. For example, criteria like mean-squared error of predictions, classification error, and log-likelihood that have been discussed in le Cessie and van Houwelingen (1992) will each produce different results, as is also shown in the latter study. Furthermore, the resultant ridge estimator is sensitive to the type of cross-validation used. For example, -fold cross-validation will produce different results for different choices of . Lastly, standard asymptotic inferential procedures for performing hypothesis tests and constructing confidence intervals cannot be used by simply replacing the maximum likelihood estimator with the ridge estimator in the associated pivots. For the above reasons, ridge estimators can only offer an ad-hoc solution to the problem.

Several simulation studies on well-used models for discrete responses have demonstrated that bias reduction via the adjustment of the log-likelihood derivatives (Firth, 1993) offers a solution to the problems relating to boundary estimates; see, for example, Mehrabi and Matthews (1995) for the estimation of simple complementary log-log models, Heinze and Schemper (2002) and Bull et al. (2002); Kosmidis and Firth (2011) for binomial and multinomial logistic regression, respectively, and Kosmidis (2009) for binomial-response generalized linear models.

In the current paper the aforementioned adjustment is derived and evaluated for the estimation of cumulative link models for ordinal responses. It is shown that reduction of bias is equivalent to a parameter-dependent additive adjustment of the multinomial counts and that such adjustment generalizes well-known constant adjustments in cases like the estimation of cumulative logits. Then, the reduced-bias estimates can be obtained through iterative maximum likelihood fits to the adjusted counts. The form of the parameter-dependent adjustment is also used to show that, like the maximum likelihood estimator, the reduced-bias estimator is invariant to the level of sample aggregation present in the data.

Furthermore, it is shown that the reduced-bias estimator respects the invariance properties that make cumulative link models an attractive choice for the analysis of ordinal data. The finiteness and shrinkage properties of the proposed estimator are illustrated via detailed complete enumeration and an extensive simulation exercise. In particular, the reduced-bias estimator is found to be always finite, and also the reduction of bias in cumulative link models results in the shrinkage of the multinomial model towards a smaller binomial model for the end-categories. A thorough discussion on the desirable frequentist properties of the estimator is provided along with an investigation of the performance of associated inferential procedures.

The finiteness of the reduced-bias estimator, its optimal frequentist properties and the adequate performance of the associated inferential procedures lead to its proposal for routine use in fitting cumulative link models.

The exposition of the methodology is accompanied by a parallel discussion of the corresponding implications in the application of the models through examples with artificial and real data.

Suppose observations of -vectors of counts , on mutually independent multinomial random vectors , where and the multinomial categories are ordered. The multinomial totals for are and the probability for the th category of the th multinomial vector is , with . The cumulative link model links the cumulative probability to a -vector of covariates via the relationship

 γrs=G(αs−p∑t=1βtxrt)(s=1,…,q;r=1,…,n), (2)

where denotes the number of the non-redundant components of , and where is a -vector of real-valued model parameters, with . The function is a monotone increasing function mapping to , usually chosen to be a known distribution function (like, for example, the logistic, extreme value or standard normal distribution function). Then, can be considered as cutpoints on the latent scale implied by .

Special important cases of cumulative link models are the proportional-odds model with , and the proportional hazards model in discrete time with (see, McCullagh, 1980, for the introduction of and a thorough discussion on cumulative link models).

The cumulative link model can be written in the usual multivariate generalized linear models form by writing the relationship that links the cumulative probability to as

 G−1(γrs)=ηrs=p+q∑t=1δtzrst(s=1,…,q;r=1,…,n), (3)

where is the th component of the matrix

In order to be able to identify , the matrix with row blocks is assumed to be of full rank.

Direct differentiation of the multinomial log-likelihood gives that the th component of the vector of score functions has the form

 Ut(δ)=n∑r=1q∑s=1grs(δ)(yrsπrs(δ)−yrs+1πrs+1(δ))zrst(t=1,…,p+q), (4)

where , with . If is log-concave then has unique solution the maximum likelihood estimate (see, Pratt, 1981, where it is shown that the log-concavity of g(.) implies the concavity of ).

All generalized linear models for binomial responses that include an intercept parameter in the linear predictor are special cases of model (2).

## 3 Maximum likelihood estimates on the boundary

The maximum likelihood estimates of the parameters of the cumulative link model can be on the boundary of the parameter space with positive probability. Under the log-concavity of , Haberman (1980) gives conditions that guarantee that the maximum likelihood estimates are not on the boundary (“exist” in an alternative terminology). Boundary estimates for these models are estimates of the regression parameters with an infinite value, and/or estimates of the cutpoints for which at least a pair of consecutive cutpoints have equal estimated value.

As far as the regression parameters are concerned, Agresti (2010, § 3.4.5) gives an accessible account on what data settings result in infinite estimates for the regression parameters, how a fitted model with such estimates can be used for inference and how such problems can be identified from the output of standard statistical software.

As far as boundary values of the cutpoints are concerned, Pratt (1981) showed that with a log-concave , the cutpoints and have equal estimates if and only if the observed counts for the th category are zero for all . If the first or the last category have zero counts then the respective estimates for and will be and , respectively, and if this happens for category for some , then the estimates for and will have the same finite value.

## 4 Bias correction and bias reduction

### 4.1 Adjusted score functions and first-order bias

Denote by the first term in the asymptotic expansion of the bias of the maximum likelihood estimator in decreasing orders of information, usually sample-size. Call the first-order bias term, and let denote the expected information matrix for . Firth (1993) showed that, if then the solution of the adjusted score equations

 U∗t(δ)=Ut(δ)+At(δ)=0(t=1,…,q+p), (5)

results in an estimator that is free from the first-order term in the asymptotic expansion of its bias.

### 4.2 Reduced-bias estimator

Kosmidis and Firth (2009) exploited the structure of the bias-reducing adjusted score functions in (5) in the case of exponential family non-linear models. Using Kosmidis and Firth (2009, expression (9)) for the adjusted score functions in the case of multivariate generalized linear models, and temporarily omitting the argument from the quantities that depend on it, the adjustment functions in (5) have the form

 At=12n∑r=1mrq∑s=1tr[Vr{(DrΣ−1r)s⊗1q}D2(πr;ηr)]zrst(t=1,…,q+p), (6)

where is the asymptotic variance-covariance matrix of the estimator for the vector of predictor functions and . Furthermore, is the matrix with th block the hessian of with respect to , is the identity matrix and is the Jacobian of with respect to . A straightforward calculation shows that

The matrix is the incomplete variance-covariance matrix of the multinomial vector with th component

 σrsu={mrπrs(1−πrs),s=u−mrπrsπru,s≠u(s,u=1,…,q;r=1,…,n).

Substituting in (6), some tedious calculation gives that the adjustment functions have the form

 At=n∑r=1q∑s=1grs(crs−crs−1πrs−crs+1−crsπrs+1)zrst(t=1,…,q+p), (7)

where

with , and , The quantity is the th diagonal component of the matrix .

Substituting (4) and (7) in (5) gives that the th component of the bias-reducing adjusted score vector has the form

 U∗t(δ)=n∑r=1q∑s=1grs(δ){yrs+crs(δ)−crs−1(δ)πrs(δ)−yrs+1+crs+1(δ)−crs(δ)πrs+1(δ)}zrst. (9)

The reduced-bias estimates are such that for every . Kosmidis (2007a, Chapter 6) shows that if the maximum likelihood is consistent, then the reduced-bias estimator is also consistent. Furthermore, has the same asymptotic distribution as , namely a multivariate Normal distribution with mean and variance-covariance matrix . Hence, estimated standard errors for can be obtained as usual by using the square roots of the diagonal elements of the inverse of the Fisher information at . All inferential procedures that rely in the asymptotic Normality of the estimator can directly be adapted to the reduced-bias estimator.

### 4.3 Bias-corrected estimator

Expression (7) can also be used to evaluate the first-order bias term as , where . If is the maximum likelihood estimator then

 ~δBC=^δ−b(^δ) (10)

is the bias-corrected estimator which has been studied in Cordeiro and McCullagh (1991) for univariate generalized linear models. The estimator can be shown to have no first-order term in the expansion of its bias (see, Efron, 1975, for analytic derivation of this result).

### 4.4 Models for binomial responses

For , has a Binomial distribution with index and probability , and . Then model (2) reduces to the univariate generalized linear model

 G(πr)=α−p∑t=1βtxrt(r=1,…,n).

From (9), the adjusted score functions take the form

 U∗t(δ)=n∑r=1gr1(δ){yr1+cr1(δ)πr1(δ)−mr−yr1−cr1(δ)1−πr1(δ)}zr1t(t=1,…,p+1).

Omitting the category index for notational simplicity, a re-expression of the above equality gives that the adjusted score functions for binomial generalized linear models have the form

 U∗t(δ)=n∑r=1grπr(1−πr)(yr+g′r2wrhr−mrπr)zrt(t=1,…,p+1), (11)

where are the working weights and is the th diagonal component of the “hat” matrix , with and

The above expression agrees with the results in Kosmidis and Firth (2009, §4.3), where it is shown that for generalized linear models reduction of bias via adjusted score functions is equivalent to replacing the actual count with the parameter-dependent adjusted count .

## 5 Implementation

### 5.1 Maximum likelihood fits on iteratively adjusted counts

When expression (9) is compared to expression (4), it is directly apparent that bias-reduction is equivalent to the additive adjustment of the multinomial count by the quantity . Noting that these quantities depend on the model parameters in general, this interpretation of bias-reduction can be exploited to set-up an iterative scheme with a stationary point at the reduced-bias estimates: at each step, i) evaluate at the current value of , and ii) fit the original model to the adjusted counts using some standard maximum likelihood routine.

However, can take negative values which in turn may result in fitting the model on negative counts in step ii) above. In principle this is possible but then the log-concavity of does not necessarily imply concavity of the log-likelihood function and problems may arise when performing the maximization in ii) (see, for example, Pratt, 1981, where the transition from the log-concavity of to the concavity of the likelihood requires that the latter is a weighted sum with non-negative weights). That is the reason why many published maximum likelihood fitting routines will complain if supplied with negative counts.

The issue can be remedied through a simple calculation. Temporarily omitting the index , let . Then the kernel in (9) can be re-expressed as

 ys+asI(as>0)−πsas+1I(as+1≤0)/πs+1πs−ys+1+as+1I(as+1>0)−πs+1asI(as≤0)/πsπs+1,

where if holds and otherwise. Note that

 as(δ)I(as(δ)>0)−πs(δ)as+1(δ)I(as+1(δ)<0)/πs+1(δ)≥0,

uniformly in . Hence, if step i) in the above procedure adjusts by evaluated at the current value of , then the possibility of issues relating to negative adjusted counts in step ii) is eliminated, and the resultant iterative procedure still has a stationary point at the reduced-bias estimates.

### 5.2 Iterative bias correction

Another way to obtain the reduced-bias estimates is via the iterative bias-correction procedure of Kosmidis and Firth (2010); if the current value of the estimates is then the next candidate value is calculated as

 δ(i+1)=^δ(i+1)−b(δ(i))(i=0,1,…), (12)

where , that is is the next candidate value for the maximum likelihood estimator obtained through a single Fisher scoring step, starting from .

Iteration (12) generally requires more effort in implementation than the iteration described in the Subsection 5.1. Nevertheless, if the starting value is chosen to be the maximum likelihood estimates then the first step of the procedure in (12) will result in the bias-corrected estimates defined in (10).

### 6.1 Estimation of cumulative logits

For the estimation of the cumulative logits from a single multinomial observation the maximum likelihood estimator of is , where is the th cumulative count. The Fisher information for is the matrix of quadratic weights . The matrix is symmetric and tri-diagonal with non-zero components

 Wss =mγ2s(1−γs)2(1γs−γs−1+1γs+1−γs)(s=1,…,q) Ws−1,s =−mγs−1(1−γs−1)γs(1−γs)γs−γs−1(s=2,…,q),

with and . By use of the recursion formulae in Usmani (1994) for the inversion of a tri-diagonal matrix, the th diagonal component of is . Hence, using (8) and noting that for the logistic link, . Substituting in (9) yields that reduction of bias is equivalent to adding to the counts for the first and the last category and leaving the rest of the counts unchanged.

The above adjustment scheme reproduces the empirical logistic transforms , which are always finite and have smaller asymptotic bias than (see Cox and Snell, 1989, §2.1.6, under the fact that the marginal distribution of given is Binomial with index and probability for any ).

### 6.2 A note of caution for constant adjustments in general settings

Since the works of Haldane (1955) and Anscombe (1956) concerning the additive modification of the binomial count by for reducing the bias and guaranteeing finiteness in the problem of log-odds estimation, the addition of small constants to counts when the data are sparse has become a standard practice for avoiding estimates on the boundary of the parameter space of categorical response models (see, for example Hitchcock, 1962; Gart and Zweifel, 1967; Gart et al., 1985; Clogg et al., 1991). Especially in cumulative link models where is log-concave, if all the counts are positive then the maximum likelihood estimates cannot be on the boundary of the parameter space (Haberman, 1980).

Despite their simplicity, constant adjustment schemes are not recommended for general use for two reasons:

1. Because the adjustments are constants, the resultant estimators are generally not invariant to different representations of the data (for example, aggregated and disaggregated view), a desirable invariance property that the maximum likelihood estimator has, and which allows the practitioner not to be concerned with whether the data at hand are fully aggregated or not.

###### Example 6.1:

For example, consider the two representations of the same data in Table 2. Interest is in estimating the difference between logits of cumulative probabilities of the samples with from the samples with .

The maximum likelihood estimate of is . Irrespective of the data representation the maximum likelihood estimate of is finite and has value with estimated standard error of . Now suppose that the same small constant, say , is added to each of the counts in the rows of the tables. The adjustment ensures that the parameter estimates are finite for both representations. Nevertheless, a common constant added to both tables causes — in some cases large — differences in the resultant inferences for . For the left table the maximum likelihood estimate of based on the adjusted data is with estimated standard error of , and for the right table the estimate is with estimated standard error of . If Wald-type procedures were used for inferences on with a Normal approximation for the distribution of the approximate pivot , where is the asymptotic standard error at based on the Fisher information, then the -value of the test would be if the left table was used and if the right table was used.

2. Furthermore, the moments of the maximum likelihood estimator generally depend on the parameter values (see, for example Cordeiro and McCullagh, 1991, for explicit expressions of the first-order bias term in the special case of binomial regression models) and thus, as is also amply evident from the studies in Hitchcock (1962) and Gart et al. (1985), there cannot be a universal constant which yields estimates which are optimal according to some frequentist criterion.

Both of the above concerns with constant adjustment schemes are dealt with by using the additive adjustment scheme in Subsection 5.1. Firstly, by construction, the iteration of Subsection 5.1 yields estimates which have bias of second-order. Secondly, because the adjustments depend on the parameters only through the linear predictors which, in turn, do not depend on the way that the data are represented, the adjustment scheme leads to estimators that are invariant to the data representation. For both representations of the data in Table 2 the bias-reduced estimate of is with estimated standard error of .

## 7 Invariance properties of the reduced-bias estimator

### 7.1 Equivariance under linear transformations

The maximum likelihood estimator is exactly equivariant under one-to-one transformations of the parameter . That is if is the maximum likelihood estimator of then, the maximum likelihood estimator of is simply . In contrast to , the reduced-bias estimator is not equivariant for all ; bias is a parameterization-specific quantity and hence any attempt to improve it can violate exact equivariance. Nevertheless, is equivariant under linear transformations , where is a matrix of constants such that is of full rank and has .

To see that, assume that one fits the multinomial model with where . Because , is a linear combination of . Using expression (9), the th component of the adjusted score function for is

 U′t=n∑r=1q∑s=1g′rs{yrs+c′rs−c′rs−1π′rs−yrs+1+c′rs+1−c′rs)π′rs+1}zrst, (13)

for , where , , are evaluated at . Note that all quantities in (13) depend on only though the linear combinations . Thus, comparing (9) to (13), if is a solution of , then must be a solution of .

The bias-corrected estimator defined in (10) can be shown also to be equivariant under linear transformations, using the equivariance of the maximum likelihood estimator and the fact that depends on only through the linear predictors.

### 7.2 Invariance under reversal of the order of categories

One of the properties of proportional-odds models, and generally of cumulative link models with a symmetric latent distribution is their invariance under the reversal of the order of categories; a reversal of the categories along with a simultaneous change of the sign of and change of sign — and hence order — to in model (2) results in the same category probabilities. Given the usual arbitrariness in the definition of ordinal scales in applications this is a desirable invariance property for the analysis of ordinal data.

The maximum likelihood estimator respects this invariance property. That is if the categories are reversed then the new fit can be obtained by merely using for the regression parameters and for the cutpoints.

The reduced-bias estimator respects the same invariance property, too. To see this, assume that one fits the multinomial model with with . Because is symmetric about zero, , and so . This is a reparameterization of model (3) to where . Hence, with

and based on the results of Subsection 7.1, (and also ).

## 8 Properties of the reduced-bias estimator and associated inferential procedures: a complete enumeration study

### 8.1 Study design

The frequentist properties of the reduced-bias estimator are investigated through a complete enumeration study of contingency tables with fixed row totals. The rows of the tables correspond to a two-level covariate with values and , and the columns to the levels of an ordinal response with categories . The row totals are fixed to for and to for . The right table in Table 2 is a special case with , , , and row totals , . The present complete enumeration involves tables. We consider a multinomial model with

 γ1s =G(αs−βx1), (14) γ2s =G(αs−βx2)(s=1,…,q),

where are regarded as nuisance parameters but are essential to be estimated from the data, because they allow flexibility in the probability configurations within each of the rows of the table.

For the estimation of we consider the maximum likelihood estimator , the bias-corrected estimator , the reduced-bias estimator , and the generalized empirical logistic transform which is defined in McCullagh (1980, §2.3) and is an alternative estimator with smaller asymptotic bias than the maximum likelihood estimator specifically engineered for the estimation of in tables with fixed row totals. The estimators , and are the -components of the vectors of estimators , and , respectively, where is the vector of all parameters. The estimators are compared in terms of bias, mean-squared error and coverage probability of the respective Wald-type asymptotic confidence intervals. The following theorem is specific to and cumulative link models, and can be used to reduce the parameter settings that need to be considered in the current study for evaluating the performance of the estimators.

###### Theorem 8.1:

Consider a contingency table with fixed row totals and , and the multinomial model that satisfies (14). Furthermore, consider an estimator of , which is equivariant under linear transformations. Then if , the bias function and the mean squared error of satisfy

 E(β∗(T)−β;β,α)=−E(β∗(T)+β;−β,α),and
 E{(β∗(T)−β)2;β,α}=E{(β∗(T)+β)2;−β,α},% respectively.
###### Proof.

Define an operator which when applied to results in a new contingency table by reversing the order of the rows of . Hence, .

Because is equivariant under linear transformations, it suffices to study the behaviour of when and . Then, any combination of values for and results by an affine transformation of the vector , and equivariance gives that a corresponding translation of the vector and change of scaling of results in exactly the same fit. Hence, the shape properties of remain invariant to the choice of .

Denote with the set of all possible tables with fixed row totals and . By the definition of the model, for every . Because there is a subset of tables with . The complement of can be partitioned into the sets and which have the same cardinality, and where if and only if . For and , equivariance under the linear transformation gives that . Then, for any , . Hence,

 E(β∗(T);β,α) =∑T∉Eβ∗(T)P(T;β,α) (15) =∑T∉E,T∈F1β∗(T){P(T;β,α)−P(R(T);β,α)} =∑T∉E,T∈F1β∗(T){P(R(T);−β,α)−P(T;−β,α)}=−E(β∗(T);−β,α)

Adding to both sides of the above equality gives the identity on the bias. For the identity on the mean squared error one merely needs to repeat a corresponding calculation to (15) starting from . ∎

A similar line of proof can be used to show that if the coverage probability of Wald-type asymptotic confidence intervals for is symmetric about , provided that the estimator of the standard error of satisfies .

### 8.2 Special case: Proportional odds model

For demonstration purposes, the values of the competing estimators are obtained for a proportional odds model () with and and , for each of the , and possible tables with row totals , for , , and , respectively. All estimators considered are equivariant under linear transformations and hence, according to the proof of Theorem 8.1, the outcome of the complete enumeration for the comparative performance of the estimators generalizes to any choice of .

The estimators and are not available in closed form and one needs to rely on iterative procedures for finding the roots of and , respectively, for every . Fisher scoring is used to obtain and the iterative maximum likelihood approach of Subsection 5.1 is used for . The maximum likelihood estimate is judged satisfactory if the current value of the iterative algorithm satisfies for every . For , the latter criterion is used with in the place of .

For evaluating the performance of the estimators, the probability of each of the tables has been calculated under model (14), for parameter values that are fixed according to the following scheme. The parameter takes values on some sufficiently fine equi-spaced grid in the interval . For in the interval the results can be predicted by the symmetry relations of Theorem 8.1. For each value of , the nuisance parameters take values for . Figure 1 is a pictorial representation of the probability settings for the two multinomial vectors in the contingency table with fixed row totals, at each combination of values for and . Under the above scheme for fixing parameter values, the probability of the end categories tends to zero as increases, and hence more extreme probability settings are being considered as grows.

The findings of the current complete enumeration exercise are outlined in the following Subsection. The same complete enumeration design has been applied to a number of settings, with , with different link functions, with different numbers of categories, and/or for different non-symmetric specifications for the nuisance parameters (results not shown here) yielding qualitatively the same conclusions; the current setup merely allows a clear pictorial representation of the findings on the behaviour of the reduced-bias estimator. An R script that can produce the results of the current complete enumeration for any number of categories, any link function, any configuration of totals and any combination of parameter settings in contingency tables is available in the supplementary material.

### 8.3 Remarks on the results

Remark 1. On the estimates of , and : According to Section 3, for data sets where a specific category is observed for neither nor , the maximum likelihood estimate of is on the boundary of the parameter space as follows:

 s=1:^α1=−∞s=2:^α2=^α1s=3:^α3=^α2s=4:^α3=+∞.

A least for log-concave , according to the results in Pratt (1981), the above equations extend directly to the case of any number of categories and number of covariate settings and can directly be used to check what happens when two or more categories are unobserved.

Nevertheless, the maximum likelihood estimator of is invariant to merging a non-observed category with either the previous or next category and can be finite even if some of the parameters are on the boundary of the parameter space. Hence, maximum likelihood inferences on are possible even if a category is not observed. The same behaviour is observed for the reduced-bias estimators of with the difference that if the non-observed category is and/or , then and/or are finite. A special case of this observation has been encountered in Subsection 6.1 where reduction of the bias corresponds to adding to the end categories, guaranteeing the finiteness of the cumulative logits. Hence, there is no need for non-observed end-categories to be merged with the neighbouring ones when the reduced-bias estimator is used. If any of the other categories is empty, then the reduced-bias estimator of is invariant to merging those with any of the neighbouring ones.

It should be mentioned here that if both the second and the third category are empty then the reduced-bias estimate of and the generalized empirical logistic transform are identical. To see that, note that in the special case of logistic regression, the adjusted scores in Subsection 4.4 suggest adding half a leverage to each of and (this result for logistic regressions was obtained in Firth, 1993). Furthermore, the model with is saturated and hence both leverages are . Hence the reduced-bias estimate of coincides with the generalized empirical logistic transform, which for is .

Remark 2. On and : As is expected from the discussion in Section 3, the maximum likelihood estimator of is infinite for certain configurations of zeros on the table, and for such configurations the bias-corrected estimator is also undefined owing to its explicit dependence on the maximum likelihood estimator. Hence, for and , the bias function is undefined and the mean squared error is infinite. A possible comparison of the performance of and is in terms of conditional bias and conditional mean squared error where the conditioning event is that has a finite value.

For detecting parameters with infinite values the diagnostics in Lesaffre and Albert (1989, §4) for multinomial logistic regressions are adapted to the current setting. Data sets that result in infinite estimates for have been detected by observation of the size of the corresponding estimated standard error based on the inverse of the Fisher information, and by observation of the absolute value of the estimates when the convergence criteria were satisfied. If the standard error was greater than and the estimate was greater than , then the estimate was labelled infinite. A second pass through the data sets has been performed making the convergence criterion for the Fisher scoring stricter than . The estimates that were labelled infinite using the aforementioned diagnostics, further diverged towards infinity while the rest of the estimates remained unchanged to high accuracy.

The probability of encountering an infinite for the different possible parameter settings is shown at the top row of Figure 2. For the probability of encountering an infinite value is simply a reflection of the probability in . As is apparent the probability of infinite estimates increases as increases and for each value of it increases as increases. As is natural as increases, the probability of encountering infinite estimates is reduced but is always positive.

Of course, the findings from the current comparison of with should be interpreted critically, bearing in mind the conditioning on the finiteness of ; the comparison suffers from the fact that the first-order bias term that is required for the calculation of is calculated unconditionally. The comparison is fairer when the probability of infinite estimates is small; this happens on a region around zero whose size also increases as increases.

The conditional bias and conditional mean squared error of and are shown in the left and right of the second row of Figure 2. The identities in Theorem 8.1 apply also for the conditional and conditional mean squared error; to see this set to be the conditional probability of each table in the proof of Theorem 8.1. Hence, for , the conditional bias is simply a reflection of the conditional bias for across the line, and the conditional mean squared error is a reflection of the conditional mean squared error for across .

The behaviour of the estimators in terms of conditional bias is similar, with the maximum likelihood estimator performing slightly better than for small . As increases the bias corrected estimator starts performing better in terms of bias in a region around zero, where the probability of infinite estimates is smallest. The same is noted for the conditional mean squared error. The estimator performs better than in a region around zero, whose size increases as increases. The same behaviour as for persists for larger values of (figures not shown here).

Remark 3. On and : The estimators and , always have finite value irrespective of the configuration of zeros on the table. Hence, in contrast to and , a comparison in terms of their unconditional bias and unconditional mean squared error is possible. The left plot of Figure 3 shows the bias function of the estimator for the parameter settings considered in the complete enumeration study. For , the bias fu