A note on quadratic approximations of logistic log-likelihoods
Quadratic approximations of logistic log-likelihoods are fundamental to facilitate estimation and inference for binary variables. Although classical expansions underlying Newton–Raphson and Fisher scoring methods have attracted much of the interest, there has been also a recent focus on quadratic bounds that uniformly minorize the logistic log-likelihood, and are tangent to it in a specific point. Compared to the classical Taylor expansion of the score function, these approximations provide iterative estimation procedures which guarantee monotonicity in the log-likelihood sequence, and motivate variational methods for Bayesian inference. A relevant contribution, within this class of approximations, relies on a convex duality argument to derive a tractable family of tangent quadratic expansions indexed by a location parameter. Although this approximation is widely used in practice, less attempts have been made to understand its probabilistic justification and the associated properties. To address this gap, we formally relate this quadratic lower bound to a recent Pólya-gamma data augmentation, showing that the approximation error associated with the bound coincides with the Kullback–Leibler divergence between a generic Pólya-gamma variable and the one obtained by conditioning on the observed response data. This result facilitates the study of the optimality properties associated with the minorize-majorize and variational Bayes routines leveraging this quadratic bound, and motivates a novel mean-field variational Bayes for logistic regression.
Regression models for binary data are ubiquitous in statistics, due to their major role in inference and prediction for dichotomous responses [?]. Within this class of models, logistic regression provides a highly popular formulation which is formally related to the generalized linear models framework and has a natural exponential family representation for the conditional distribution of the Bernoulli data given a set of covariates [?].
Although the direct connection with the exponential families provides a set of well-established methods for estimation and inference, there is a still active research within this class of models to address other key open questions. For instance, the second order Taylor expansion of the logistic log-likelihood employed by classical Newton–Raphson routines for maximum likelihood estimation [?], does not guarantee monotone log-likelihood sequences, thus potentially affecting the stability of the maximization routine in particular scenarios [?]. This has motivated other computational methods leveraging alternative quadratic approximations which uniformly minorize the logistic log-likelihood, and are tangent to it in a given point [?], thus guaranteeing monotone log-likelihood sequences [?].
Besides facilitating maximum likelihood estimation under a frequentist perspective, the above quadratic lower bounds additionally provide fundamental representations under a Bayesian paradigm, especially in approximate variational inference [?]. Indeed, tractable lower bounds for the predictive log-density are a key to derive simple methods which can accurately approximate the posterior distribution of the model parameters [?]. To address this goal, a seminal contribution by [?] proposes a quadratic lower bound for the logistic log-likelihood, which exploits a convexity argument for the logistic function to obtain the bound
for the log-likelihood of each observation , from a logistic regression. In , denote the vector of covariates measured for each unit , whereas are the associated coefficients. The quantities denote instead unit-specific variational parameters defining the location where is tangent to . In fact, when , coincides with . [?] adjust these parameters within an iterative procedure to approximate via , while preserving the quadratic form of , which allows conjugate updating for under a Gaussian prior.
Although has been studied from a mathematical perspective [?], and provides a popular expansion in Bayesian inference and machine learning [?], it still lacks a probabilistic motivation. Recalling [?], computational methods characterized by a constructive probabilistic representation, such as the expectation-maximization algorithm, and the classical mean-field variational Bayes, have a direct justification, clear interpretation and fall within a general methodological framework provided with optimality properties. As discussed in Section 2, the bound proposed by [?], although apparently supported by purely mathematical arguments, has indeed a clear probabilistic interpretation.
2Probabilistic justification of via Pólya-gamma
This Section discusses the theoretical connection between and a recent Pólya-gamma data augmentation for conjugate inference in Bayesian logistic regression [?]. As shown in Theorem ?, these two representations are closely related, thereby providing a novel probabilistic interpretation of .
The proof of Theorem ? exploits results from the Pólya-gamma data augmentation [?]. In particular, the core contribution of [?] is in showing that the logistic probability mass function can be obtained by marginalizing out the Pólya-gamma latent variable , in the joint distribution with characterizing the density function of . Their result follows easily after noticing that and .
The above results suggest a data augmentation scheme which facilitates the implementation of a Gibbs sampler for conjugate Bayesian inference under Gaussian priors for the parameters in . In fact, as shown in [?], the joint distribution has a Gaussian kernel in . [?] also proposed an expectation-maximization routine for maximum a posteriori estimation, discussing direct connections with the variational Bayes in [?]. Their findings are however limited to computational differences and similarities among the two methods. We instead provide a fully generative connection between the contribution by [?] and the one from [?].
To prove Theorem ?, first notice that and . Replacing such quantities in and , respectively, we obtain
To highlight in the above equation, note that, adapting results in [?], the quantity can easily be re-expressed as , with the expectation taken with respect to a Pólya-gamma . Hence
Based on the above expression, the proof is concluded after noticing that the quantities at the numerator and the denominator denote the density functions and of the Pólya-gamma variables and , respectively [?]. Therefore, .
Note that, based on the Pólya-gamma data augmentation scheme, the joint density factorizes as [?], thereby allowing to be interpreted as the Kullback–Leibler divergence between a generic Pólya-gamma variable, and the one obtained by conditioning on the observed data . According to Corollary ?, this result allows to be interpreted as the evidence lower bound of under a Pólya-gamma data augmentation.
The proof of Corollary ? follows easily after applying the well-known equation . In fact, based on results in Theorem ?, .
Theorem ? and Corollary ? provide a probabilistic interpretation for the convex duality argument underlying the lower bound developed by [?]. Indeed, according to the above results, their expansion is a proper evidence lower bound for the logistic log-likelihood, induced by a latent variable representation via Pólya-gamma augmented data, which provides a quadratic complete log-likelihood . Besides this result, Theorem ? additionally provides a formal characterization for the approximation error . Indeed, this quantity is the Kullback–Leibler divergence between a generic Pólya-gamma random variable, and the one obtained by conditioning on the observed data. This allows to complete their inequality , obtaining
Equation has direct connections with maximum likelihood estimation via expectation-maximization algorithms, and Bayesian variational inference [?]. Leveraging this result, Section 3 shows that the maximum likelihood estimation routine proposed by [?] coincides with an expectation-maximization algorithm via Pólya-gamma augmented data, which has optimal properties compared to current routines characterized by monotone log-likelihood sequences. Section 4 highlights, instead, that the variational Bayes in [?], although apparently local, is indeed a global method which minimizes the Kullback–Leibler divergence between the joint posterior distribution of the coefficients and the augmented data , under a mean-field variational approximation based on Pólya-gamma variables. This novel global algorithm is also analytically derived.
3Maximum likelihood estimation via expectation-maximization methods
Leveraging the quadratic approximation , [?] propose an iterative routine for maximum likelihood estimation that has a monotone log-likelihood sequence and simple maximization. In particular, letting denote the estimated value for the coefficients at step and simplifying the calculations in the Appendix C of [?], their routine first maximizes with respect to , obtaining for each , and then derive by maximizing . This last step is straightforward due to the quadratic form of the lower bound, thus providing
with a diagonal matrix with entries , and .
Analyzing the above two-step maximization routine in the light of equation , it can be easily noticed that , , leads to the solution minimizing the Kullback–Leibler divergence in , whereas the function
maximized with respect to is equal, up to an additive constant, to the expected value of the complete-data log-likelihood computed with respect to the conditional distribution of the Pólya-gamma data , for . Combining these results with the key rationale underlying the expectation-maximization algorithm [?], it easily follows that the maximum likelihood estimation routine proposed by [?] is indeed an expectation-maximization algorithm based on Pólya-gamma augmented data. According to Algorithm ?, this routines first computes the expectation of , and then maximizes it with respect to .
As already discussed in [?], the above maximization method guarantees a monotone log-likelihood sequence , thus ensuring a stable convergence. Indeed, leveraging equation , it can be easily noticed that the above routine is a genuine minorize-majorize algorithm [?], provided that for each , and . In fact, when . We shall notice that also [?] and [?] highlighted this relation with minorize-majorize algorithms under a mathematical argument, while discussing the sharpness of the bound in . Leveraging results in Section 2, we additionally show that the minorize-majorize algorithm based on is more efficient than the one in [?]. To our knowledge, this is the only tractable minorize-majorize alternative to [?].
To address the above goal, let us first re-write to allow a direct comparison with the solution
from [?], with and the identity matrix. Indeed, a similar mapping can be also derived by adding and subtracting inside in to obtain the solution
after noticing that each entry in can be written as
A closer inspection of and shows that the updating equations underlying the bounds in [?] and [?] coincide with the one from the Newton–Raphson, after replacing the Hessian , with in and in . Recalling the discussion in [?] and the results in Section 2, both matrices define a lower bound for the Hessian, thus guaranteeing that the solutions and provide a monotone sequence for the log-likelihood. In [?] the uniform bound follows after noticing that for every , whereas, according to equation , the adaptive bound induced by [?] is formally related to an exact data augmentation strategy, thus suggesting that may provide a more efficient updating scheme than . This claim is formalized in Proposition ? by comparing the convergence rates of the two algorithms. Refer to [?] for details regarding the definition and the computation of the convergence rate associated with a generic iterative routine.
To prove the claim in Proposition ?, note that can be easily computed as in [?], exploiting the fact that does not depend on in . It is instead not immediate to calculate via direct differentiation of , since contains more complex hyperbolic transformations of . However, exploiting the probabilistic findings discussed in Section 2, this issue can be easily circumvented leveraging the expectation-maximization interpretation of the routine in [?] via Pólya-gamma augmented data. Indeed, following [?], the rate matrix of an iterative routine based on expectation-maximization methods, is equal to , with denoting the expectation, taken with respect to the augmented data, of the complete-data information matrix . This quantity can be easily computed in our case, provided that the complete-data log-likelihood is equal, up to an additive constant, to the quadratic function of , which is also linear in the augmented Pólya-gamma data . Due to this, it is easy to show that , where is the diagonal matrix with entries .
Recalling the above discussion, the proof of Proposition ? requires comparing the maximum eigenvalues of and . To address this goal, let us first show that . In particular, since the matrices , and are positive definite, then is positive semidefinite if . Proving this inequality, coincides with showing that is positive semidefinite. Since the hermitian matrix is a standard quadratic form, this result holds if all the entries in the diagonal matrix are non-negative.
Letting , and re-writing the well-established inequality [?] as , it easily follows that , thus guaranteeing , and, as a direct consequence, . This result concludes the proof. In fact, implies that [?].
Proposition ? guarantees that the updating scheme uniformly improves the convergence rate of . To understand this result, note that high values of imply slow convergence. We shall however emphasize that the expectation-maximization routine underlying the quadratic approximation in [?] still preserves a linear convergence rate, provided that . Hence, and do not reach the quadratic convergence rate of the Newton–Raphson scheme, but have the benefit of guaranteeing monotone log-likelihood sequences, while coinciding with the Newton–Raphson when . In this respect, note that although is not defined for , the expectation of the Pólya-gamma is still available and coincides with , thus making , and all equal when . It is also important to highlight that although the minorize-majorize routine in [?] has slower convergence, the matrix in does not depend on , thus requiring inversion only once during the iterative procedure. This may lead to a relevant reduction in computational complexity, especially in high dimensional problems, compared to the updating in , which requires, instead, inversion of at each iteration.
4Conjugate variational inference via Pólya-gamma latent variables
Although the overarching focus of Section 3 has been on maximum likelihood estimation, it should be noticed that the original motivation underlying the lower bound proposed by [?] comes from approximate Bayesian inference. In this setting, the main goal is to approximate the posterior distribution of the model parameters with a suitable density function , which minimizes the Kullback–Leibler divergence . Recalling, for example, [?] and [?] this minimization problem can be alternatively addressed by maximizing the evidence lower bound of the log-marginal probability mass function with respect to the entire , provided that
with . As discussed in [?], when is assigned a Gaussian prior , the maximization of with respect to is not straightforward due to the absence of conjugacy between the logistic likelihood of the observed data , and the Gaussian prior for . To address this issue, the authors replace the logistic likelihood in with the quadratic bound in , thus obtaining a joint distribution which is proportional to a Gaussian kernel in for every . Hence, given , the variational distribution maximizing is easily available as the multivariate Gaussian whose kernel is proportional to . Refer to equations (10) and (11) in [?] for the solutions of this optimization problem. The maximization of with respect to the variational parameters in proceeds instead via an expectation-maximization step exploiting the fact that, given the current estimate of , the lower bound is equal, up to an additive constant, to the expectation of a quadratic complete log-likelihood, with acting as augmented data. This approach provides the simple solution , for each , presented in equation (12) of [?].
Although replacing each in with the quadratic bound in provides major improvements in computational tractability, this operation partially breaks the fundamental equality supporting the variational argument. In fact, substituting with in equation , we obtain only the lower bound relationship , which makes it less straightforward to understand how the maximization of formally relates to a minimization of . Indeed, [?], classifies the variational Bayes in [?] as a local method which does not attempt an approximation to the full posterior over all variables, but relies on further local bounds over subsets of parameters, whose optimization does not match exactly the global evidence lower bound. For a similar reason, [?] place the routine in [?] among the variational methods beyond conjugacy. In these settings the evidence lower bound is typically intractable and coordinate ascent variational inference [?] is not straightforward.
Leveraging the probabilistic interpretation of , formalized in Theorem ?, Corollary ? proves that the variational Bayes in [?] is indeed a global methodology based on conjugacy results.
The proof is a direct consequence of Corollary ?. In particular, based on the above discussion, replacing with in , provides the lower bound
Based on the results in Corollary ? of Theorem ?, every lower bound term in the above equation can be replaced with . Moreover, note that the first summand in the above equation does not depend on , thus allowing us to replace the first integral with . Making these substitutions in the above equation, and using the linearity property of integrals, we obtain
thus proving Corollary ?. Note that and .
Corollary ? provides a formal justification for the variational Bayes in [?], under a global mean-field perspective. Indeed, completing the original inequality underlying [?], in the light of the results in Corollary ?, we have that . Hence, the maximization of coincides with minimizing the Kullback–Liebler divergence between the posterior distribution of the random quantities and its mean-field approximation . Hence, although apparently local, the routine in [?] falls indeed within the general mean-field variational framework, thus motivating new coordinate ascent variational inference routines for Bayesian logistic regression, derived in Algorithm ?. Refer to [?] and [?] for details about coordinate ascent variational inference.
According to Algorithm ?, all the steps are based on conjugacy results, while providing the same solutions to those obtained by [?] under a different perspective. Hence, as previously discussed, this strategy falls within the general mean-field variational framework with conjugacy, thus allowing several recent computational advances in this field [?] to be easily adapted to the routine in [?] under our probabilistic interpretation.