A note on quadratic approximations of logistic log-likelihoods

# A note on quadratic approximations of logistic log-likelihoods

DANIELE DURANTE    TOMMASO RIGON
###### Abstract

Quadratic approximations of logistic log-likelihoods are fundamental to facilitate estimation and inference for binary variables. Although classical expansions underlying Newton–Raphson and Fisher scoring methods have attracted much of the interest, there has been also a recent focus on quadratic bounds that uniformly minorize the logistic log-likelihood, and are tangent to it in a specific point. Compared to the classical Taylor expansion of the score function, these approximations provide iterative estimation procedures which guarantee monotonicity in the log-likelihood sequence, and motivate variational methods for Bayesian inference. A relevant contribution, within this class of approximations, relies on a convex duality argument to derive a tractable family of tangent quadratic expansions indexed by a location parameter. Although this approximation is widely used in practice, less attempts have been made to understand its probabilistic justification and the associated properties. To address this gap, we formally relate this quadratic lower bound to a recent Pólya-gamma data augmentation, showing that the approximation error associated with the bound coincides with the Kullback–Leibler divergence between a generic Pólya-gamma variable and the one obtained by conditioning on the observed response data. This result facilitates the study of the optimality properties associated with the minorize-majorize and variational Bayes routines leveraging this quadratic bound, and motivates a novel mean-field variational Bayes for logistic regression.

Some key words: Expectation-maximization algorithm; Logistic regression; Minorize-majorize algorithm; Pólya-gamma data augmentation; Quadratic approximation; Variational Bayes.

\@xsect

Regression models for binary data are ubiquitous in statistics, due to their major role in inference and prediction for dichotomous responses (e.g. Agresti, 2013). Within this class of models, logistic regression provides a highly popular formulation which is formally related to the generalized linear models framework and has a natural exponential family representation for the conditional distribution of the Bernoulli data given a set of covariates (Nelder & Wedderburn, 1972).

Although the direct connection with the exponential families provides a set of well-established methods for estimation and inference, there is a still active research within this class of models to address other key open questions. For instance, the second order Taylor expansion of the logistic log-likelihood employed by classical Newton–Raphson routines for maximum likelihood estimation (e.g. Nelder & Wedderburn, 1972), does not guarantee monotone log-likelihood sequences, thus potentially affecting the stability of the maximization routine in particular scenarios (Böhning & Lindsay, 1988). This has motivated other computational methods leveraging alternative quadratic approximations which uniformly minorize the logistic log-likelihood, and are tangent to it in a given point (Böhning & Lindsay, 1988; De Leeuw & Lange, 2009; Browne et al., 2015), thus guaranteeing monotone log-likelihood sequences (Hunter & Lange, 2004).

Besides facilitating maximum likelihood estimation under a frequentist perspective, the above quadratic lower bounds additionally provide fundamental representations under a Bayesian paradigm, especially in approximate variational inference (e.g. Bishop, 2006, Chapter 10.1). Indeed, tractable lower bounds for the predictive log-density are a key to derive simple methods which can accurately approximate the posterior distribution of the model parameters (e.g. Blei et al., 2017). To address this goal, a seminal contribution by Jaakkola & Jordan (2000) proposes a quadratic lower bound for the logistic log-likelihood, which exploits a convexity argument for the logistic function to obtain the bound

 Lξi(β;yi,xi)=(yi−0⋅5)xTiβ−0⋅5ξi−0⋅25ξ−1i{tanh}(0⋅5ξi){(xTiβ)2−ξ2i}−log{1+exp(−ξi)}, (0)

for the log-likelihood of each observation , from a logistic regression. In (A note on quadratic approximations of logistic log-likelihoods), denote the vector of covariates measured for each unit , whereas are the associated coefficients. The quantities denote instead unit-specific variational parameters defining the location where (A note on quadratic approximations of logistic log-likelihoods) is tangent to . In fact, when , (A note on quadratic approximations of logistic log-likelihoods) coincides with . Jaakkola & Jordan (2000) adjust these parameters within an iterative procedure to approximate via , while preserving the quadratic form of (A note on quadratic approximations of logistic log-likelihoods), which allows conjugate updating for under a Gaussian prior.

Although (A note on quadratic approximations of logistic log-likelihoods) has been studied from a mathematical perspective (Jaakkola & Jordan, 2000; De Leeuw & Lange, 2009; Browne et al., 2015), and provides a popular expansion in Bayesian inference and machine learning (e.g. Blei et al., 2017), it still lacks a probabilistic motivation. Recalling Bishop (2006, Chapters 9.4 and 10.1), computational methods characterized by a constructive probabilistic representation, such as the expectation-maximization algorithm, and the classical mean-field variational Bayes, have a direct justification, clear interpretation and fall within a general methodological framework provided with optimality properties. As discussed in Section A note on quadratic approximations of logistic log-likelihoods, the bound proposed by Jaakkola & Jordan (2000), although apparently supported by purely mathematical arguments, has indeed a clear probabilistic interpretation.

\@xsect

This Section discusses the theoretical connection between (A note on quadratic approximations of logistic log-likelihoods) and a recent Pólya-gamma data augmentation for conjugate inference in Bayesian logistic regression (Polson et al., 2013). As shown in Theorem 1, these two representations are closely related, thereby providing a novel probabilistic interpretation of (A note on quadratic approximations of logistic log-likelihoods).

###### Theorem 1.

Let denote the quadratic lower bound in (A note on quadratic approximations of logistic log-likelihoods) developed by Jaakkola & Jordan (2000) for the logistic log-likelihood of each unit . Then

 ℓ(β;yi,xi)−Lξi(β;yi,xi)=\textsckl{f(ωi;ξi)||f(ωi;xTiβ)},(i=1,…,n), (0)

with denoting the Kullback–Leibler divergence between the density functions and of the Pólya-gamma random variables and .

The proof of Theorem 1 exploits results from the Pólya-gamma data augmentation (Polson et al., 2013). In particular, the core contribution of Polson et al. (2013) is in showing that the logistic probability mass function can be obtained by marginalizing out the Pólya-gamma latent variable , in the joint distribution with characterizing the density function of . Their result follows easily after noticing that and .

The above results suggest a data augmentation scheme which facilitates the implementation of a Gibbs sampler for conjugate Bayesian inference under Gaussian priors for the parameters in . In fact, as shown in Polson et al. (2013), the joint distribution has a Gaussian kernel in . Scott & Sun (2013) also proposed an expectation-maximization routine for maximum a posteriori estimation, discussing direct connections with the variational Bayes in Jaakkola & Jordan (2000). Their findings are however limited to computational differences and similarities among the two methods. We instead provide a fully generative connection between the contribution by Jaakkola & Jordan (2000) and the one from Polson et al. (2013).

###### Proof (of Theorem 1).

To prove Theorem 1, first notice that and . Replacing such quantities in and , respectively, we obtain

 ℓ(β;yi,xi)−Lξi(β;yi,xi)=0⋅25ξ−1i{tanh}(0⋅5ξi){(xTiβ)2−ξ2i}+log[cosh(0⋅5ξi)cosh{0⋅5(xTiβ)}−1].

To highlight in the above equation, note that, adapting results in Polson et al. (2013), the quantity can easily be re-expressed as , with the expectation taken with respect to a Pólya-gamma . Hence

 ℓ(β;yi,xi)−Lξi(β;yi,xi)=∫R+f(ωi;ξi)logexp(−0⋅5ωiξ2i)cosh(0⋅5ξi)f(ωi;0)exp{−0⋅5ωi(xTiβ)2}cosh{0⋅5(xTiβ)}f(ωi;0)dωi.

Based on the above expression, the proof is concluded after noticing that the quantities at the numerator and the denominator denote the density functions and of the Pólya-gamma variables and , respectively (Polson et al., 2013). Therefore, .

Note that, based on the Pólya-gamma data augmentation scheme, the joint density factorizes as (Polson et al., 2013), thereby allowing to be interpreted as the Kullback–Leibler divergence between a generic Pólya-gamma variable, and the one obtained by conditioning on the observed data . According to Corollary 1, this result allows (A note on quadratic approximations of logistic log-likelihoods) to be interpreted as the evidence lower bound of under a Pólya-gamma data augmentation.

###### Corollary 1.

The quadratic lower bound (A note on quadratic approximations of logistic log-likelihoods) derived by Jaakkola & Jordan (2000) can be expressed as .

###### Proof (of Corollary 1).

The proof of Corollary 1 follows easily after applying the well-known equation . In fact, based on results in Theorem 1, .

Theorem 1 and Corollary 1 provide a probabilistic interpretation for the convex duality argument underlying the lower bound developed by Jaakkola & Jordan (2000). Indeed, according to the above results, their expansion is a proper evidence lower bound for the logistic log-likelihood, induced by a latent variable representation via Pólya-gamma augmented data, which provides a quadratic complete log-likelihood . Besides this result, Theorem 1 additionally provides a formal characterization for the approximation error . Indeed, this quantity is the Kullback–Leibler divergence between a generic Pólya-gamma random variable, and the one obtained by conditioning on the observed data. This allows to complete their inequality , obtaining

 ℓ(β;yi,xi) = ∫R+f(ωi;ξi)logp(yi;xTiβ)f(ωi;xTiβ)f(ωi;ξi)dωi+∫R+f(ωi;ξi)logf(ωi;ξi)f(ωi;xTiβ)dωi, (0) = L{f(ωi;ξi),β}+\textsckl{f(ωi;ξi)||f(ωi∣yi;xTiβ)},(i=1,…,n).

Equation (A note on quadratic approximations of logistic log-likelihoods) has direct connections with maximum likelihood estimation via expectation-maximization algorithms, and Bayesian variational inference (e.g. Bishop, 2006, Chapters 9.4 and 10.1). Leveraging this result, Section A note on quadratic approximations of logistic log-likelihoods shows that the maximum likelihood estimation routine proposed by Jaakkola & Jordan (2000) coincides with an expectation-maximization algorithm via Pólya-gamma augmented data, which has optimal properties compared to current routines characterized by monotone log-likelihood sequences. Section A note on quadratic approximations of logistic log-likelihoods highlights, instead, that the variational Bayes in Jaakkola & Jordan (2000), although apparently local, is indeed a global method which minimizes the Kullback–Leibler divergence between the joint posterior distribution of the coefficients and the augmented data , under a mean-field variational approximation based on Pólya-gamma variables. This novel global algorithm is also analytically derived.

\@xsect

Leveraging the quadratic approximation , Jaakkola & Jordan (2000) propose an iterative routine for maximum likelihood estimation that has a monotone log-likelihood sequence and simple maximization. In particular, letting denote the estimated value for the coefficients at step and simplifying the calculations in the Appendix C of Jaakkola & Jordan (2000), their routine first maximizes (A note on quadratic approximations of logistic log-likelihoods) with respect to , obtaining for each , and then derive by maximizing . This last step is straightforward due to the quadratic form of the lower bound, thus providing

 β(t+1)=(XTΩ(t)X)−1XT(y1−0⋅5,…,yn−0⋅5)T=(XTΩ(t)X)−1XT(y−0⋅51n), (0)

with a diagonal matrix with entries , and .

Analyzing the above two-step maximization routine in the light of equation (A note on quadratic approximations of logistic log-likelihoods), it can be easily noticed that , , leads to the solution minimizing the Kullback–Leibler divergence in (A note on quadratic approximations of logistic log-likelihoods), whereas the function

 L¯ξ(β;y,x)=n∑i=1L{f(ωi;xTiβ(t)),β}=n∑i=1∫R+f(ωi;xTiβ(t))logp(yi;xTiβ)f(ωi;xTiβ)f(ωi;xTiβ(t))dωi,

maximized with respect to is equal, up to an additive constant, to the expected value of the complete-data log-likelihood computed with respect to the conditional distribution of the Pólya-gamma data , for . Combining these results with the key rationale underlying the expectation-maximization algorithm (e.g. Bishop, 2006, Chapters 9.4), it easily follows that the maximum likelihood estimation routine proposed by Jaakkola & Jordan (2000) is indeed an expectation-maximization algorithm based on Pólya-gamma augmented data. According to Algorithm 1, this routines first computes the expectation of , and then maximizes it with respect to .

As already discussed in Jaakkola & Jordan (2000), the above maximization method guarantees a monotone log-likelihood sequence , thus ensuring a stable convergence. Indeed, leveraging equation (A note on quadratic approximations of logistic log-likelihoods), it can be easily noticed that the above routine is a genuine minorize-majorize algorithm (e.g. Hunter & Lange, 2004), provided that for each , and . In fact, when . We shall notice that also De Leeuw & Lange (2009) and Browne et al. (2015) highlighted this relation with minorize-majorize algorithms under a mathematical argument, while discussing the sharpness of the bound in (A note on quadratic approximations of logistic log-likelihoods). Leveraging results in Section A note on quadratic approximations of logistic log-likelihoods, we additionally show that the minorize-majorize algorithm based on (A note on quadratic approximations of logistic log-likelihoods) is more efficient than the one in Böhning & Lindsay (1988). To our knowledge, this is the only tractable minorize-majorize alternative to Jaakkola & Jordan (2000).

To address the above goal, let us first re-write (A note on quadratic approximations of logistic log-likelihoods) to allow a direct comparison with the solution

 β(t+1)=β(t)+(XTΓX)−1XT(y1−π(t)1,…,yn−π(t)n)T, (0)

from Böhning & Lindsay (1988), with and the identity matrix. Indeed, a similar mapping can be also derived by adding and subtracting inside in (A note on quadratic approximations of logistic log-likelihoods) to obtain the solution

 β(t+1)=β(t)+(XTΩ(t)X)−1XT(y1−π(t)1,…,yn−π(t)n)T, (0)

after noticing that each entry in can be written as

 yi−0⋅5[1+{1−exp(−xTiβ(t))}{1+exp(−xTiβ(t))}−1]=yi−{1+exp(−xTiβ(t))}−1=yi−π(t)i.

###### Proposition 1.

Assume that is the limit, if it exists, of the sequence , and let and denote the functions mapping from to in (A note on quadratic approximations of logistic log-likelihoods) and (A note on quadratic approximations of logistic log-likelihoods), respectively. Then , with and being the maximum eigenvalues of the Jacobian matrices and , respectively, computed in .

To prove the claim in Proposition 1, note that can be easily computed as in Böhning & Lindsay (1988), exploiting the fact that does not depend on in (A note on quadratic approximations of logistic log-likelihoods). It is instead not immediate to calculate via direct differentiation of , since (A note on quadratic approximations of logistic log-likelihoods) contains more complex hyperbolic transformations of . However, exploiting the probabilistic findings discussed in Section A note on quadratic approximations of logistic log-likelihoods, this issue can be easily circumvented leveraging the expectation-maximization interpretation of the routine in Jaakkola & Jordan (2000) via Pólya-gamma augmented data. Indeed, following McLachlan & Krishnan (2007, Chapter 3.9.3), the rate matrix of an iterative routine based on expectation-maximization methods, is equal to , with denoting the expectation, taken with respect to the augmented data, of the complete-data information matrix . This quantity can be easily computed in our case, provided that the complete-data log-likelihood is equal, up to an additive constant, to the quadratic function of , which is also linear in the augmented Pólya-gamma data . Due to this, it is easy to show that , where is the diagonal matrix with entries .

###### Proof (of Proposition 1).

Recalling the above discussion, the proof of Proposition 1 requires comparing the maximum eigenvalues of and . To address this goal, let us first show that . In particular, since the matrices , and are positive definite, then is positive semidefinite if . Proving this inequality, coincides with showing that is positive semidefinite. Since the hermitian matrix is a standard quadratic form, this result holds if all the entries in the diagonal matrix are non-negative.

Letting , and re-writing the well-established inequality (e.g. Zhu, 2012) as , it easily follows that , thus guaranteeing , and, as a direct consequence, . This result concludes the proof. In fact, implies that (e.g. Knutson & Tao, 2001).

Proposition 1 guarantees that the updating scheme (A note on quadratic approximations of logistic log-likelihoods) uniformly improves the convergence rate of (A note on quadratic approximations of logistic log-likelihoods). To understand this result, note that high values of imply slow convergence. We shall however emphasize that the expectation-maximization routine underlying the quadratic approximation in Jaakkola & Jordan (2000) still preserves a linear convergence rate, provided that . Hence, (A note on quadratic approximations of logistic log-likelihoods) and (A note on quadratic approximations of logistic log-likelihoods) do not reach the quadratic convergence rate of the Newton–Raphson scheme, but have the benefit of guaranteeing monotone log-likelihood sequences, while coinciding with the Newton–Raphson when . In this respect, note that although is not defined for , the expectation of the Pólya-gamma is still available and coincides with , thus making , and all equal when . It is also important to highlight that although the minorize-majorize routine in Böhning & Lindsay (1988) has slower convergence, the matrix in (A note on quadratic approximations of logistic log-likelihoods) does not depend on , thus requiring inversion only once during the iterative procedure. This may lead to a relevant reduction in computational complexity, especially in high dimensional problems, compared to the updating in (A note on quadratic approximations of logistic log-likelihoods), which requires, instead, inversion of at each iteration.

###### Remark 1.

Although the above focus has been on maximum likelihood estimation methods, the probabilistic interpretation (A note on quadratic approximations of logistic log-likelihoods) of the quadratic bound in Jaakkola & Jordan (2000) motivates simple adaptations to include the maximum a posteriori estimation problem under a Bayesian framework. In this case, the goal is to maximize the log-posterior , where is the log-prior, whereas denotes the log-marginal probability mass function of obtained by marginalizing out in . Substituting in , the log-posterior to be maximized becomes

 log{f(β∣y;x)}=n∑i=1[L{f(ωi;ξi),β}+\textsckl{f(ωi;ξi)||f(ωi∣yi;xTiβ)}]+log{f(β)}−log{p(y;x)},

where the quantity is a normalizing constant. Hence, also maximum a posteriori estimation can be easily performed via minor modifications of the expectation-maximization routine in Algorithm 1. Indeed, since does not appear in the log-prior, the expectation step coincides with the standard one previously discussed, thus providing . The maximization step, requires instead calculating . Note that, when has a Gaussian prior, the maximization has analytic solution, and is closely related to classical regularized maximum likelihood estimation via ridge-type penalty, thus allowing maximization in problems (e.g. Hastie et al., 2009, Chapter 3.4). A Laplace prior would lead instead to a standard lasso-type penalty, whose solution can be easily obtained by considering the least angle regression algorithm in the maximization step (e.g. Hastie et al., 2009, Chapter 3.8).

\@xsect

Although the overarching focus of Section A note on quadratic approximations of logistic log-likelihoods has been on maximum likelihood estimation, it should be noticed that the original motivation underlying the lower bound proposed by Jaakkola & Jordan (2000) comes from approximate Bayesian inference. In this setting, the main goal is to approximate the posterior distribution of the model parameters with a suitable density function , which minimizes the Kullback–Leibler divergence . Recalling, for example, Bishop (2006, Chapter 10.1) and Blei et al. (2017) this minimization problem can be alternatively addressed by maximizing the evidence lower bound of the log-marginal probability mass function with respect to the entire , provided that

 log{p(y;x)} = ∫Rpf(β;θ)logf(β)∏ni=1p(yi;xTiβ)f(β;θ)dβ+∫Rpf(β;θ)logf(β;θ)f(β∣y;x)dβ, (0) = L{f(β;θ)}+\textsckl{f(β;θ)||f(β∣y;x)},

with . As discussed in Jaakkola & Jordan (2000), when is assigned a Gaussian prior , the maximization of with respect to is not straightforward due to the absence of conjugacy between the logistic likelihood of the observed data , and the Gaussian prior for . To address this issue, the authors replace the logistic likelihood in (A note on quadratic approximations of logistic log-likelihoods) with the quadratic bound in (A note on quadratic approximations of logistic log-likelihoods), thus obtaining a joint distribution which is proportional to a Gaussian kernel in for every . Hence, given , the variational distribution maximizing is easily available as the multivariate Gaussian whose kernel is proportional to . Refer to equations (10) and (11) in Jaakkola & Jordan (2000) for the solutions of this optimization problem. The maximization of with respect to the variational parameters in proceeds instead via an expectation-maximization step exploiting the fact that, given the current estimate of , the lower bound is equal, up to an additive constant, to the expectation of a quadratic complete log-likelihood, with acting as augmented data. This approach provides the simple solution