# Title

# Most Likely Transformations

###### Abstract

We propose and study properties of maximum likelihood estimators in the class of conditional transformation models. Based on a suitable explicit parameterisation of the unconditional or conditional transformation function, we establish a cascade of increasingly complex transformation models that can be estimated, compared and analysed in the maximum likelihood framework. Models for the unconditional or conditional distribution function of any univariate response variable can be set-up and estimated in the same theoretical and computational framework simply by choosing an appropriate transformation function and parameterisation thereof. The ability to evaluate the distribution function directly allows us to estimate models based on the exact likelihood, especially in the presence of random censoring or truncation. For discrete and continuous responses, we establish the asymptotic normality of the proposed estimators. A reference software implementation of maximum likelihood-based estimation for conditional transformation models that allows the same flexibility as the theory developed here was employed to illustrate the wide range of possible applications.

Most Likely Transformations

Torsten Hothorn Universität Zürich Lisa Möst Universität München Peter Bühlmann ETH Zürich

Keywords: transformation model, distribution regression, conditional distribution function, conditional quantile function, censoring, truncation.

^{†}

^{†}A shortened version of this technical report was accepted for publication in the Scandinavian Journal of Statistics 2017-06-19.

## 1 Introduction

In a broad sense, we can understand all statistical models as models of distributions or certain characteristics thereof, especially the mean. All distributions for at least ordered responses can be characterised by their distribution, quantile, density, odds, hazard or cumulative hazard functions. In a fully parametric setting, all these functions have been specified up to unknown parameters, and the ease of interpretation can guide us in looking at the appropriate function. In the semi- and non-parametric contexts, however, the question arises how we can obtain an estimate of one of these functions without assuming much about their shape. For the direct estimation of distribution functions, we deal with monotonic functions in the unit interval, whereas for densities, we need to make sure that the estimator integrates to one. The hazard function comes with a positivity constraint, and monotonicity is required for the positive cumulative hazard function. These computationally inconvenient restrictions disappear completely only when the log-hazard function is estimated, and this explains the plethora of research papers following this path. However, the lack of any structure in the log-hazard function comes at a price. A too-erratic behaviour of estimates of the log-hazard function has to be prevented by some smoothness constraint; this makes classical likelihood inference impossible. The novel characterisation and subsequent estimation of distributions via their transformation function in a broad class of transformation models that are developed in this paper can be interpreted as a compromise between structure (monotonicity) and ease of parameterisation, estimation and inference. This transformation approach to modelling and estimation allows standard likelihood inference in a large class of models that have so far commonly been dealt with by other inference procedures.

Since the introduction of transformation models based on non-linear transformations of some response variable by Box and Cox (1964), this attractive class of models has received much interest. In regression problems, transformation models can be understood as models for the conditional distribution function and are sometimes referred to as “distribution regression”, in contrast to their “quantile regression” counterpart (Chernozhukov et al. 2013). Traditionally, the models were actively studied and applied in the analysis of ordered categorical or censored responses. Recently, transformation models for the direct estimation of conditional distribution functions for arbitrary responses received interest in the context of counterfactual distributions (Chernozhukov et al. 2013), probabilistic forecasting (Gneiting and Katzfuss 2014), distribution and quantile regression (Leorato and Peracchi 2015; Rothe and Wied 2013), probabilistic index models (Thas et al. 2012) and conditional transformation models (Hothorn et al. 2014). The core idea of any transformation model is the application of a strictly monotonic transformation function for the reformulation of an unknown distribution function as , where the unknown transformation function is estimated from the data. Transformation models have received attention especially in situations where the likelihood contains terms involving the conditional distribution function with inverse link function , most importantly for censored, truncated and ordered categorical responses. For partially linear transformation models with transformation function , much emphasis has been given to estimation procedures treating the baseline transformation (e.g. the log-cumulative baseline hazard function in the Cox model) as a high-dimensional nuisance parameter. Prominent members of these estimation procedures are the partial likelihood estimator (Cox 1975) and approaches influenced by the estimation equations introduced by Cheng et al. (1995) and Chen et al. (2002). Once an estimate for the shift is obtained, the baseline transformation is then typically estimated by the non-parametric maximum likelihood estimator (see, for example, Cheng et al. 1997). An overview of the extensive literature on the simultaneous non-parametric maximum likelihood estimation of and , i.e. estimation procedures not requiring an explicit parameterisation of , for censored continuous responses is given in Zeng and Lin (2007).

An explicit parameterisation of is common in models of ordinal responses (Tutz 2012). For survival times, Kooperberg et al. (1995) and Kooperberg and Clarkson (1997) introduced a cubic spline parameterisation of the log-conditional hazard function with the possibility of response-varying effects and estimated the corresponding models by maximum likelihood. Crowther and Lambert (2014) followed up on this suggestion and used restricted cubic splines. Many authors studied penalised likelihood approaches for spline approximations of the baseline hazard function in a Cox model, e.g. Joly et al. (1998); Cai and Betensky (2003); Sabanés Bové and Held (2013) or Ma et al. (2014). Less frequently, the transformation function was modelled directly. Mallick and Walker (2003); Chang et al. (2005) and McLain and Ghosh (2013) used Bernstein polynomials for , and Royston and Parmar (2002) proposed a maximum likelihood approach using cubic splines for modelling and also time-varying effects. The connection between these different transformation models is difficult to see because most authors present their models in the relatively narrow contexts of survival or ordinal data. The lack of a general understanding of transformation models made the development of novel approaches in this model class burdensome. Hothorn et al. (2014) decoupled the parameterisation of the conditional transformation function from the estimation procedure, and showed that many interesting and novel models can be understood as transformation models. The boosting-based optimisation of proper scoring rules, however, was only developed for uncensored und right-censored (Möst and Hothorn 2015) observations in the absence of truncation and requires the numerical approximation of the true target function. In a similar spirit, Chernozhukov et al. (2013) applied the connection for estimation in the response-varying effects transformation model ; this approach can be traced back to Foresi and Peracchi (1995).

A drawback of all but the simplest transformation models is the lack of a likelihood estimation procedure. Furthermore, although important connections to other models have been known for some time (Doksum and Gasko 1990), it is often not easy to see how broad and powerful the class of transformation models actually is. We address these issues and embed the estimation of unconditional and conditional distribution functions of arbitrary univariate random variables under all forms of random censoring and truncation into a common theoretical and computational likelihood-based framework. In a nutshell, we show in Section 2 that all distributions can be generated by a strictly monotonic transformation of some absolute continuous random variable. The likelihood function of the transformed variable can then be characterised by this transformation function. The parameters of appropriate parameterisations of the transformation function, and thus the parameters of the conditional distribution function in which we are interested, can then be estimated by maximum likelihood under simple linear constraints that allow classical asymptotic likelihood inference, as will be shown in Section 3. Many classical and contemporary models are introduced as special cases of this framework. In particular, all transformation models sketched in this introduction can be understood and estimated in this novel likelihood-based framework. Extensions of classical and contemporary transformation models as well as some novel models are derived from our unified theoretical framework of transformation functions in Section 4, and their empirical performance is illustrated and evaluated in Section 5.

## 2 The Likelihood of Transformations

Let denote a probability space and a measureable space with at least ordered sample space . We are interested in inference about the distribution of a random variable , i.e. the probability space defined by the measureable function . For the sake of notational simplicity, we present our results for the unconditional and ordered case first; regression models and unordered responses are discussed in Section 4.2. The distribution is dominated by some measure and characterised by its density function , distribution function , quantile function , odds function , hazard function or cumulative hazard function . For notational convenience, we assume strict monotonicity of , i.e. . Our aim is to obtain an estimate of the distribution function from a random sample . In the following, we will show that one can always write this potentially complex distribution function as the composition of a much simpler and a priori specified distribution function and a strictly monotonic transformation function . The task of estimating is then reduced to obtaining an estimate . The latter exercise, as we will show in this paper, is technically and conceptually attractive.

Let denote the Euclidian space with Borel -algebra and an measureable function such that the distribution is absolutely continuous ( denotes the Lebesgue measure) in the probability space . Let and denote the corresponding distribution and quantile functions. We furthermore assume , and for a log-concave density as well as the existence of the first two derivatives of the density with respect to ; both derivatives shall be bounded. We do not allow any unknown parameters for this distribution. Possible choices include the standard normal, standard logistic (SL) and minimum extreme value (MEV) distribution with distribution functions , and , respectively. In the first step, we will show that there always exists a unique and strictly monotonic transformation such that the unknown and potentially complex distribution that we are interested in can be generated from the simple and known distribution via . More formally, let denote a measureable function. The composition is a random variable on . We can now formulate the existence and uniqueness of as a corollary to the probability integral transform.

###### Corollary 1.

For all random variables and , there exists a unique strictly monotonically increasing transformation , such that .

###### Proof.

Let and . Then and by the probability integral transform. Let , such that . From

we get the uniqueness of and therefore . The quantile function and the distribution function exist by assumption and are both strictly monotonic and right-continuous. Therefore, is strictly monotonic and right-continuous and so is . ∎

###### Corollary 2.

For , we have and .

This result for absolutely continuous random variables can be found in many textbooks (e.g. Lindsey 1996), Corollary 1 also covers the discrete case.

###### Corollary 3.

For the counting measure , is a right-continuous step function because is a right-continuous step function with steps at .

We now characterise the distribution by the corresponding transformation function , set up the corresponding likelihood of such a transformation function and estimate the transformation function based on this likelihood. Let denote the space of all strictly monotonic transformation functions. With the transformation function , we can evaluate as . Therefore, we only need to study the transformation function ; the inverse transformation (used to define a “group family” or “group model” by Lehmann 1983; Bickel et al. 1993) is not necessary in what follows. The density for absolutely continuous variables () is now given by

For discrete responses () with finite sample space , the density is

and for countably infinite sample spaces , we get the density

With the conventions and , we use the more compact notation in the sequel.

For a given transformation function , the likelihood contribution of a datum is defined in terms of the distribution function (Lindsey 1996):

This “exact” definition of the likelihood applies to most practical situations of interest and, in particular, allows discrete and (conceptually) continuous as well as censored or truncated observations . For a discrete response , we have and , such that . For absolutely continuous random variables , we almost always observe an imprecise datum and, for short intervals , approximate the exact likelihood by the term or simply with (Lindsey 1999). This approximation only works for relatively precise measurements, i.e. short intervals. If longer intervals are observed, one speaks of “censoring” and relies on the exact definition of the likelihood contribution instead of using the above approximation (Klein and Moeschberger 2003). In summary, the likelihood contribution of a conceptually “exact continuous” or left-, right- or interval-censored continuous or discrete observation is given by

under the assumption of random censoring. The likelihood is more complex under dependent censoring (Klein and Moeschberger 2003), but we will not elaborate on this issue. The likelihood contribution of an ordered factor in category is equivalent to the term contributed by an interval-censored observation , when category is defined by the interval . Thus, the expression for the likelihood contribution reflects the equivalence of interval-censoring and categorisation at corresponding cut-off points.

For truncated observations in the interval , the above likelihood contribution is defined in terms of the distribution function conditional on the truncation

and thus the likelihood contribution changes to (Klein and Moeschberger 2003)

It is important to note that the likelihood is always defined in terms of a distribution function (Lindsey 1999), and it therefore makes sense to directly model the distribution function of interest. The ability to uniquely characterise this distribution function by the transformation function gives rise to the following definition of an estimator .

###### Definition 1 (Most likely transformation).

Let denote an independent sample of possibly randomly censored or truncated observations from . The estimator

is called the most likely transformation (MLT).

Log-concavity of ensures concavity of the log-likelihood (except when all observations are right-censored) and thus ensures the existence and uniqueness of .

##### Example

For an absolutely continuous response the likelihood and log-likelihood for are approximated by the density and log-density evaluated at , respectively:

Strict monotonicity of the transformation function is required; otherwise the likelihood is not defined. The term is not a penalty term, but the likelihood favours transformation functions with a large positive derivative at the observations. If we assume and for the choice with and , we can restrict to linear functions . The likelihood reduces to

In this simple location-scale family, the most likely transformation is characterised by the parameters of the normal distribution of . It is important to note that for other choices of , the most likely transformation is non-linear; however, the distribution function is invariant with respect to because we can always write as . In other words, with , we can still model normal responses ; however, a non-linear transformation function is required.

Many distributions are defined by a transformation function , for example, the Box-Cox power exponential family (Rigby and Stasinopoulos 2004), the sinh-arcsinh distributions (Jones and Pewsey 2009), or the T-X family of distributions (Alzaatreh et al. 2013). The parameters of these distributions can, for example, be estimated by the GAMLSS approach (Rigby and Stasinopoulos 2005). In what follows, we do not assume any specific form of the transformation function but parameterise in terms of basis functions. We now introduce such a parameterisation, a corresponding family of distributions, a maximum likelihood estimator and a large class of models for unconditional and conditional distributions.

## 3 Transformation Analysis

We parameterise the transformation function as a linear function of its basis-transformed argument using a basis function , such that . The choice of the basis function is problem specific and will be discussed in Section 4. The likelihood only requires evaluation of , and only an approximation thereof using the Lebesgue density of “exact continuous” observations makes the evaluation of the first derivative of with respect to necessary. In this case, the derivative with respect to is given by , and we assume that is available. In the following, we will write and for the transformation function and its first derivative, omitting the argument , and we assume that both functions are bounded away from and . For a specific choice of and , the transformation family of distributions consists of all distributions whose distribution function is given as the composition ; this family can be formally defined as follows.

###### Definition 2 (Transformation family).

The distribution family

with parameter space is called transformation family of distributions with transformation functions , -densities , and error distribution function .

The classical definition of a transformation family relies on the idea of invariant distributions, i.e. only the parameters of a distribution are changed by a transformation function but the distribution itself is not changed. The normal family characterised by affine transformations is the most well-known example (e.g. Fraser 1968; Lindsey 1996). Here, we explicitly allow and encourage transformation functions that change the shape of the distribution. The transformation function is, at least in principle, flexible enough to generate any distribution function from the distribution function . We borrow the term “error distribution” function for from Fraser (1968), because can be understood as an error term in some of the models discussed in Section 4. The problem of estimating the unknown transformation function , and thus the unknown distribution function , reduces to the problem of estimating the parameter vector through maximisation of the likelihood function. We assume that the basis function is such that the parameters are identifiable.

###### Definition 3 (Maximum likelihood estimator).

Based on the maximum likelihood estimator , we define plug-in estimators of the most likely transformation function and the corresponding estimator of our target distribution as and . Because the problem of estimating an unknown distribution function is now embedded in the maximum likelihood framework, the asymptotic analysis benefits from standard results on the asymptotic behaviour of maximum likelihood estimators. We begin with deriving the score function and Fisher information. The score contribution of an “exact continuous” observation from an absolutely continuous distribution is approximated by the gradient of the log-density

(1) | |||||

For an interval-censored or discrete observation and (the constant terms and vanish), the score contribution is

(2) | |||||

For a truncated observation, the score function is .

The contribution of an “exact continuous” observation from an absolutely continuous distribution to the Fisher information is approximately

(NB: the weight to is constant one for ). For a censored or discrete observation, we have the following contribution to the Fisher information

For a truncated observation, the Fisher information is given by .

We will first discuss the asymptotic properties of the maximum likelihood estimator in the parametric setting with fixed parameters in both the discrete and continuous case. For continuous variables and a transformation function parameterised using a Bernstein polynomial, results for sieve maximum likelihood estimation, where the number of parameters increases with , are then discussed in Subsection 3.2.

### 3.1 Parametric Inference

Conditions on the densities of the error distribution and the basis functions ensuring consistency and asymptotic normality of the sequence of maximum likelihood estimators and an estimator of their asymptotic covariance matrix are given in the following three theorems. Due to the full parameterisation of the model, the proofs are simple standard results for likelihood asymptotics, and a more complex analysis (as required for estimation equations in the presence of a nuisance parameter , for example in Cheng et al. 1995; Chen et al. 2002) is not necessary. We will restrict ourselves to absolutely continuous or discrete random variables , where the likelihood is given in terms of the density . Furthermore, we will only study the case of a correctly specified transformation and refer the reader to Hothorn et al. (2014), where consistency results for arbitrary are given.

###### Theorem 1.

For and under the assumptions (A1) the parameter space is compact and (A2) where is well-separated:

the sequence of estimators converges to in probability, , as .

###### Proof.

The log-likelihood is continuous in , and due to (A2), each log-likelihood contribution is dominated by an integrable function. Thus, the result follows from van der Vaart (1998) (Theorem 5.8 with Example 19.7; see note at bottom of page 46). ∎

###### Remark 1.

Assumption (A1) is made for convenience, and relaxations of such a condition are given in van de Geer (2000) or van der Vaart (1998). The assumptions in (A2) are rather weak: the first one holds if the functions are not arbitrarily ill-posed, and the second one holds if the function is strictly convex in (if the assumption would not hold, we would still have convergence to the set ).

###### Theorem 2.

Under the assumptions of Theorem 1 and in addition (A3)

(A4) and (for the absolutely continuous case only) are nonsingular, and (A5) , and , the sequence is asymptotically normal with mean zero and covariance matrix

as .

###### Proof.

Because the map is continuously differentiable in for all in both the discrete and absolutely continuous case and the matrix

is continuous in as given in (1) and (2), the transformation family is differentiable in quadratic mean with Lemma 7.6 in van der Vaart (1998). Furthermore, assumptions (A4-5) ensure that the expected Fisher information matrix is nonsingular at . With the consistency and (A3), the result follows from Theorem 5.39 in van der Vaart (1998). ∎

###### Remark 2.

###### Theorem 3.

Under the assumptions of Theorem 2 and assuming , a consistent estimator for is given by

###### Proof.

Based on Theorems 1-3, we can perform standard likelihood inference on the model parameters . In particular, we can construct confidence intervals and confidence bands for the conditional distribution function from confidence intervals and bands for the linear functions . We complete this part by formally defining the class of transformation models.

###### Definition 4 (Transformation model).

The triple is called transformation model.

The transformation model fully defines the distribution of via and thus the corresponding likelihood . Our definition of transformation models as is strongly tied to the idea of structural inference (Fraser 1968) and group families (Lehmann 1983) or group models (Bickel et al. 1993). Fraser (1968) described a measurement model for by an error distribution and a structural equation , where is a linear function, thereby extending the location-scale family introduced by Fisher (1934) and refined by Pitman (1939). Group models consist of distributions generated by possibly non-linear . The main difference to these classical approaches is that we parameterise instead of . By extending the linear transformation functions dealt with by Fraser (1968) to non-linear transformations, we approximate the potentially non-linear transformation functions by , with subsequent estimation of the parameters . For given parameters , a sample from can be drawn by the probability integral transform, i.e. is drawn and then . This generalises the method published by Bender et al. (2005) from the Cox model to all conditional transformation models.

### 3.2 Non-parametric Inference

For continuous responses , any unknown transformation can be approximated by Bernstein polynomials of increasing order (Farouki 2012). For uncensored and right-censored responses and under the same conditions for as stated in Subsection 3.1, McLain and Ghosh (2013) showed that the non-parametric sieve maximum likelihood estimator is consistent with rate of convergence for with continuous bounded second derivatives in unconditional and linear transformation models (see Subsection 4.3). In the latter class, the linear shift parameters are asympotically normal and semi-parametrically efficient. Numerical approximations to the observed Fisher information were shown to lead to appropriate standard errors of by McLain and Ghosh (2013). Hothorn et al. (2014) established the consistency of boosted non-parametric conditional transformation models (see Subsection 4.2). For sieve maximum likelihood estimation in the class of conditional transformation models, the techniques employed by McLain and Ghosh (2013) require minor technical extensions, which are omitted here.

In summary, the same limiting distribution arises under both the parametric and the non-parametric paradigm for transformation functions parameterised or approximated using Bernstein polynomials, respectively. In the latter case, the target is then the best approximated transformation function with Bernstein polynomials, say (where the index indicates that we use a more complex approximation when increases). If the approximation error is of smaller order than the convergence rate of the estimator, the estimator’s target becomes the true underlying transformation function , and otherwise a bias for estimating remains.

## 4 Applications

The definition of transformation models tailored for specific situations “only” requires the definition of a suitable basis function and a choice of . In this section, we will discuss specific transformation models for unconditional and conditional distributions of ordered and unordered categorical, discrete and continuous responses . Note that the likelihood function allows all these models to be fitted to arbitrarily censored or truncated responses; for brevity, we will not elaborate on the details.

### 4.1 Unconditional Transformation Models

##### Finite Sample Space

For ordered categorical responses from a finite sample space , we assign one parameter to each element of the sample space except . This corresponds to the basis function , where is the unit vector of length , with its th element being one. The transformation function is