Multivariate Bayesian Logistic Regression for Analysis of Clinical Study Safety Issues\thanksrefT1

# Multivariate Bayesian Logistic Regression for Analysis of Clinical Study Safety Issues

## Abstract

This paper describes a method for a model-based analysis of clinical safety data called multivariate Bayesian logistic regression (MBLR). Parallel logistic regression models are fit to a set of medically related issues, or response variables, and MBLR allows information from the different issues to “borrow strength” from each other. The method is especially suited to sparse response data, as often occurs when fine-grained adverse events are collected from subjects in studies sized more for efficacy than for safety investigations. A combined analysis of data from multiple studies can be performed and the method enables a search for vulnerable subgroups based on the covariates in the regression model. An example involving 10 medically related issues from a pool of 8 studies is presented, as well as simulations showing distributional properties of the method.

\kwd
\volume

27 \issue3 2012 \firstpage319 \lastpage339 \doi10.1214/11-STS381

\relateddois

T1Discussed in \relateddoid10.1214/12-STS381A, \relateddoid10.1214/12-STS381B and \relateddoid10.1214/12-STS381C; rejoinder at \relateddoir10.1214/12-STS381REJ. \runtitleMBLR for Clinical Safety Data

{aug}

a]\fnmsWilliam \snmDuMouchel\correflabel=e1]bill.dumouchel@oracle.com

Adverse drug reactions \kwdBayesian shrinkage \kwddrug safety \kwddata granularity \kwdhierarchical Bayesian model \kwdparallel logistic regressions \kwdsparse data \kwdvariance component estimation.

## 1 Introduction

This paper introduces an analysis method for safety data from a pool of clinical studies called multivariate Bayesian logistic regression analysis (MBLR). The dependent or response variables in the MBLR are defined at the subject level, that is, for each subject the response is either 0 or 1 for each safety issue, depending on whether that subject has been determined to be affected by that issue based on the data available at the time of the analysis. Safety issues can include occurrence of specific adverse events as well as clinically significant lab tests or other safety-related measurements. The predictor variables, assumed to be dichotomous or categorical, are all assumed to be observable at the time of subject randomization. The analysis is cross-sectional rather than longitudinal, and does not take into account the variability, if any, of the length of time different subjects have been observed. The primary predictor is study Arm, assumed to be dichotomous with values “Treatment” or “Comparator.” Other subject-level covariates may be included, such as gender or age categories, or medical history variables. One feature of the MBLR approach is that the interactions of treatment arm with each of the other covariates are automatically included in the analysis model. Data from a pool of multiple studies (having common treatment arm definitions) may be included in the same analysis, in which case the study identifier would be considered a subject covariate. Analyses involving a pool of studies are similar in spirit to a full-data meta-analysis.

Estimation of effects involves a hierarchical Bayesian algorithm as described below. There are two primary rationales for the Bayesian approach. First, data concerning safety issues are often sparse, leading to high variability in relative rates of rare events among subject subgroups, and the smoothing inherent in empirical Bayes shrinkage estimates can alleviate problems with estimation of ratios of small rates and the use of multiple post-hoc comparisons when encountering unexpected effects. Second,MBLR fits the same analytical model to each response variable and then allows the estimates of effects for the different responses to “borrow strength” from each other, to the extent that the patterns of coefficient estimates across different responses are similar. This implies that the different safety issues should be medically related, so that it is plausible that the different issues have related mechanisms of causation or are different expressions of a broad syndrome, such as being involved in the same body system, or different MedDRA terms in nearby locations in the MedDRA hierarchy of adverse event definitions. The goal is to assist with the problem of uncertain granularity of analysis. The question of how to classify and group adverse drug reaction reports can be controversial because different assignments can change the statistical significance of count data treatment effects, and methods and definitions for comparing adverse drug event rates are not well standardized (Dean, 2003). Sometimes the amount of data available for each of the related safety issues is too little for reliable comparisons, whereas doing a single analysis on a transformed response, defined as present when any of the original issues are present, risks submerging a few potentially significant issues among others having no treatment association. The Bayesian approach is a compromise between these two extremes.

The proposed methodology is not intended to replace or replicate other processes for evaluating safety risk but rather to support and augment them. In spite of the formal modeling structure, its spirit is more a mixture of exploratory and confirmatory analysis, a way to get a big picture review when there are very many parameters of interest. The resulting estimates with confidence intervals can provide a new approach to the problem of how to best evaluate safety risk from clinical studies designed to test efficacy.

This paper describes the statistical model and the estimation algorithms used in a commercial implementation of MBLR. There is also some discussion of alternate models and algorithms with reasons for our choices. An example analysis utilizes data from a set of clinical studies generously provided by an industry partner, and a simulation provides information on the statistical properties of the method.

## 2 The Bayesian Model for MBLR

As with standard logistic regression, MBLR produces parameter estimates interpretable as log odds, and provides upper and lower confidence bounds for these estimates. The method is based on the hierarchical Bayesian model described below. Identical regression models (i.e., the same predictor variables for different response variables) are estimated assuming that the relationships being examined are all based on the same underlying process. The response variables represent issues comprising a potentially common safety problem and the underlying process is an adverse reaction caused by the treatment compound. The regression models are various examinations of relationships between subgroups defined by the covariates and the response issues. The Bayesian estimates of treatment-by-covariate interactions are conservative (estimates are “shrunk” toward null hypothesis values), in order to reduce the false alarm due to high variance in small sample sizes. This conservatism is a form of adjustment for multiple comparisons.

It is natural to desire a comparison of MBLR with a more standard analysis, which, for the present purpose, means a logistic regression model where the estimates for the different responses are not shrunk toward each other, and where interactions between treatment and other covariates are not being estimated. However, it can often happen, with sparse safety data involving rare adverse events and the use of other predictors in addition to the treatment effect, that standard logistic regression estimation can fail, because the likelihood function has no unique finite maximizing set of parameters. Gelman et al. (2008) discuss this problem, caused by what they call separation and sparsity, and suggest the automatic use of weakly informative prior distributions as a default choice for such analyses. Along the lines of the Gelman et al. (2008) suggestion, we will compare MBLR to a “weak Bayes” method that corresponds to setting certain variance components (that are estimated by MBLR) to values selected to be so large that the resulting estimates would be virtually the same as those of standard logistic regression if the data are not so sparse as to be unidentifiable. This comparison method will be denoted regularized logistic regression (RLR).

The event data can be considered as a -column matrix , with a row for each subject and a column for each issue, and where if subject  experienced issue , otherwise. Since all subject covariates are assumed categorical, we will use a grouped-data approach, where there are subjects () that have identical covariates and treatment allocation in the th group, and where of these subjects experienced issue .

MBLR requires the inclusion of the treatment arm and of one or more additional predictors in the model, where all predictors are categorical.

Across the set of issues a single regression model is used. If there are predictors excluding Treatment, and the th predictor has categories, , then there will be subgroups analyzed. The model will usually have degrees of freedom and estimation is performed by constraining sums of coefficients involving the same covariate to add to 0. The Bayesian methods presented here allow estimation in the presence of additional collinearity of predictors, in which case the computed posterior standard deviations would then reflect the uncertainty inherent in a deficient design.

For the th group of subjects, the modeled probability of experiencing issue is , where

 Pik = 1/[1+exp(−Zik)]and where (1) Zik = α0k+∑1≤g≤GXigαgk (2) +Ti(β0k+∑1≤g≤GXigβgk).

The columns of define the dummy variables for the covariates, and is an indicator for the treatment status of the th group. The values of (; ) define the risk of issue for the comparator subjects. As mentioned above, the sums , where the sums are over the categories of each covariate for each . The more natural quantities () are the log odds that a comparator subject in subgroup will experience issue , , averaged across the categories of other predictors not defined by subgroup .

Concerning treatment effects, the quantities () are the estimated log odds ratios for the risk of issue (treatment versus comparator) that subjects in group experience, , averaged across the categories of other predictors not defined by subgroup . The sums are constrained just as the ’s were.

If is large, there will be many possible subgroup comparisons, and, since these confidence intervals have not been adjusted for multiple comparisons, caution is advised in interpreting the largest few of such observed subgroup estimates. The MBLR estimates of these quantities are designed to be more reliable in the presence of multiple comparisons because subgroup-by-treatment interaction effects are “shrunk” toward 0 in a statistically appropriate way, and there is also a partial averaging across issues, so that subgroup and treatment effects and subgroup-by-treatment interactions can “borrow strength” if there is an observed similar pattern of treatment and subgroup effects in most of the issues being analyzed. When configuring a multivariate Bayesian logistic regression, the analyst should try to select those issues for which there is some suspicion of a common medical mechanism involved. If the Bayesian algorithm does not detect a common pattern of subgroup effects, then the Bayesian algorithm will perform little partial averaging across issues, because corresponding variance component estimates will be large.

The Bayesian model is a two-stage hierarchical prior specification:

 αgk|Ag ∼ N(Ag,σ2A), (3) \eqntextk=1,…,K;g=1,…,G, (4) β0k|B0 ∼ N(B0,σ20),k=1,…,K, (5) βgk|Bg ∼ N(Bg,σ2B), (6) \eqntextk=1,…,K;g=1,…,G, (7) Bg ∼ N(0,τ2),g=1,…,G. (8)

The prior distributions of , , and of , , and of are assumed uniform within (). Equations (3)–(6) embody the assumption that coefficients for the same predictor across multiple issues cluster around the predictor-specific values () with the degree of clustering dependent on the magnitude of three variances (). If any of these variances are near 0, there will be a tight cluster of the corresponding regression coefficients across the responses, whereas if they are large, there may be no noticeable common pattern across for predictor . The values of correspond to the constant terms in the regressions, and we assume no common shrinkage of constant terms across issues, since the absolute frequencies of the issues are not being modeled here.

Equation (8) embodies the assumption that the null hypotheses (i.e., no treatment-by-covariate interactions when averaged across responses) are given priority in the analyses. This is the assumption that helps protect against the multiple comparisons fallacy when searching for vulnerable covariate subgroups. The value of determines how strongly to shrink the prior means toward 0 in the second level of the prior specification.

The four standard deviations ( have prior distributions assumed to be uniform in the four-dimensional cube , . Their joint posterior distribution is approximated by a discrete distribution for computational convenience, as described below. The posterior distribution of the coefficients is defined as a mixture of the distributions of the coefficients conditional on the possible values of the variance components. The method produces an approximate variance–covariance matrix for all the coefficients, and this also allows the estimation of standard deviations and confidence intervals (credible intervals) for linear combinations of parameters such as the quantities () describing the total treatment effect estimates for each subgroup.

General discussion of hierarchical Bayesian regression models is available in Carlin and Louis (2000), although the particular model (involving multiple responses) and estimation methods used in this paper are not discussed there. Searle, Casella and McCulloch [(1992), Chapter 9] also discuss related methods, including a logit-normal model somewhat similar to this one.

## 3 Estimation Details

### Estimation of MBLR Parameters

The estimation algorithm for MBLR is based on separate maximizations of the posterior distributions of the coefficients, conditional on the values of the variance components. Then these posterior distributions are averaged to provide an overall posterior distribution, where the weights in the average are determined by the Bayes factors for different values of the vector of the four variance components. First we assume that the four standard deviations () are fixed and known and consider estimation of the other parameters.

### Estimation of Coefficients and Prior Means Conditional on Prior Standard Deviations

There are such parameters: prior means, values and values of . However, sums of these parameters are defined as 0, leaving dimensions for estimation. It is convenient to imagine that subjects are grouped according to unique values of their covariates and treatment allocation, so that the data are the sample sizes and the counts (; ), where indexes strata defined by unique values of covariates and treatment. The joint distribution of the parameters and the data can be represented as

 p(A1,…,AG,B0,…,BG) ⋅∏kp(α0k,…,αGk,β0k,…,βGk|{A}{B}) (9) ⋅∏k⋅p({Nik}|{A}{B}{α}{β}).

The prior distributions of , and the are assumed uniform over (), whereas all the remaining parameters have prior distributions as given in equations (3)–(8).

Therefore, if is the log posterior joint distribution of all the parameters, then, up to a constant,

 2logL = −[∑g>0B2g/τ2+(G−J)log(τ2)] (10) −[∑g>0∑k(αgk−Ag)2/σ2A +(G−J)Klog(σ2A)] −[∑k(β0k−B0)2/σ20+Klog(σ20)] −[∑g>0∑k(βgk−Bg)2/σ2B +(G−J)Klog(σ2B)] +2∑i∑k[Niklog(Pik) +2∑i∑k[+(ni−Nik)log(1−Pik)].

In (10), the terms involving , and all have a factor (), rather than the more natural , since there are values and . But since they are being estimated subject to constraints where subsets of them add to , the factor () is substituted, analogous to the way REML estimates are defined for variance components in a frequentist analysis. For fixed variance components, maximization of (10) with respect to all other parameters, remembering that the are defined by (1) and (2), involves a relatively straightforward modification of the usual logistic regression calculations. The prior means and are treated analogously to the coefficients and during the Newton–Raphson maximization of . Each iteration involves calculation of the vector of first derivatives of with respect to the parameters in (10) and the Hessian matrix of the negative second derivatives of . The initial values of are , (the subscript “” means sum over the values of ), whereas the initial values of all other parameters are 0.

Upon convergence of the maximization, the variance–covariance matrix of the estimated parameters is assumed to be

 V=V(σA,σ0,σB,τ)=H−1. (11)

[Actually, the matrix will be singular because of the constraints that reduce the rank of . The interpretation of (11) is as follows. Define a subset of parameters out of the -vector , where one parameter from each constrained subset has been omitted, but will be constrained to be equal to the negative of the sum of the other parameters in its subset. Define the matrix that converts from to , that is, . Then (11) is interpreted as . The same transformation is used during the Newton–Raphson maximization of . Also, in (15) and later, the determinant of is computed as the determinant of .]

The computation of as uses the assumption that the counts are independent across both and , conditional on the parameters. The occurrence of different events in the same subject may be connected via the parameters, but not otherwise correlated in this model. If this assumption is violated, the variances in may be underestimated. Since the parameters include both all the coefficients as well as their prior means, the variances in for any one component automatically include uncertainty due to correlation with all other components. In particular, uncertainty in the prior means is taken account of in the estimated posterior variances of the (up to the accuracy of the approximate multivariate normality of the joint posterior distribution of the parameters).

### Accounting for Uncertainty in the Prior Standard Deviations

The prior distribution of the set of possible values of () is assumed to be uniform within the four-dimensional cube with limits (), where a default value of is selected as discussed below. A discrete search method approximates the posterior distribution within this cube. Before discussing the details, consider the situation where the prior standard deviation vector is assumed to be one of discrete values . Denote the vector of coefficients and prior means by , and assume that the maximized and the estimated posterior mean and covariance matrix of  are if , . Then the marginal posterior distribution of , adjusting for uncertainty in , is assumed to be multivariate normal with mean and covariance matrix , where{subequation}

 ^θ = ∑sπsθs, (12) V = ∑sπs[Vs+(θs−^θ)(θs−^θ)t], (13)

and where , the posterior weight given to , , is defined by {subequation}

 πs = BFs/(BF1+⋯+BFS), (14) BFs = exp(logLs)√det(Vs). (15)

The quantity is the (relative) Bayes factor for the hypothesis . The usual definition of the Bayes factor requires the integration of the joint likelihood over the space of all parameters not specified by the hypothesis—in this case the space of all . Using the approximation of this likelihood as proportional to a multivariate normal density with covariance matrix , and the known fact that volume under the multivariate exponential form is proportional to the square root of the determinant of , the definition of is as given in (3). The approximation (3) is the standard Laplace approximation often used for numerical integration in Bayesian methods. However, a different justification for computing (15) in order to obtain estimates for variance components is given by the theory of h-likelihood (Lee and Nelder, 1996; Lee, Nelder and Pawitan, 2006; Meng, 2009).

### Selection and computation of the values (ϕs,πs), s=1,…,S

Representing the 4-dimensional naturally continuous distribution of by a set of discrete points is a challenge. Assuming a range of for each element of and a spacing of 0.1 would mean a grid of  points, the vast majority ofwhich would have values of nearly 0. Determination of a set of just points to represent the approximate posterior distribution of is performed as outlined next. A logistic transformation is used to convert the bounded cube to the unbounded region where all four elements can range from () by defining

 λ = (λA,λ0,λB,λτ)where σA = d/(1+e−λA),σ0=d/(1+e−λ0), (16) σB = d/(1+e−λB),τ=d/(1+e−λτ).

With this transformation, a uniform prior distribution on () for each corresponds to a prior distribution for each over the real line of . The purpose of this transform is to allow simpler search procedures that don’t have to worry about boundary constraints, as well as to make approximation of the posterior by a multivariate normal distribution more accurate. Then the posterior density of is assumed to be

 g(λ) = g(λA,λ0,λB,λτ) (17) ∝ f(λA)f(λ0)f(λB)f(λτ) ⋅exp(logLs)√det(Vs),

where and in (17) are now functions of , and the ’s vary over ().

The determination of the discrete distribution (, , is a five-step process:

Step 1: Use the method of steepest ascent to find the value that maximizes in (13). Derivatives of g are computed numerically as first difference ratios with respect to each of the four arguments. The starting value for the search is .

Step 2: Construct a design of -values by adding 16 points on the surface of each of two concentric spheres centered at . The points on the inner sphere consist of 8 star points, where one component of is and the other three components equal , and 8 half-fractional factorial points, where all components are . The points on the outer sphere are similar to those on the inner sphere, except that is replaced by and the fractional factorial points are from the opposite half fraction as the fractional factorial points on the inner sphere. The default value of on the scale of . Visualizing the geometry of the design, if a 4-dimensional sphere has radius 1.5 times another, it encloses about 5 times the volume.

Step 3: The double central composite design of Step 2 is centered but not scaled to the actual distribution . To find the appropriate scale factors in each dimension, , for a better fitting design, a quadratic response surface model is fit to values of across the points of this initial design. The fitted model is

 logg(λ)=c0+∑iciλi+∑i≤jcijλiλj. (18)

Now if the quadratic model fit exactly (i.e., if were exactly multivariate normal), then the second-order coefficients would specify the elements of the inverse of the posterior covariance matrix of . Accordingly, we get what are hoped to be approximate posterior standard deviations by setting of square roots of the diagonal of , where

 2H=⎡⎢ ⎢ ⎢⎣2c11c12c13c14c122c22c23c24c13c232c33c34c14c24c342c44⎤⎥ ⎥ ⎥⎦. (19)

Step 4: Next a new design like that of Step 2 is constructed except that the used in Step 2 for all 4 dimensions is replaced by from Step 3, so that the spheres are scaled differently in each dimension. The values of are computed for these 32 new points and a new quadratic response surface is fit to this 33-point final design. Let the peak of this fitted surface be denoted , which will not exactly equal , and redefine by using the coefficients from the new quadratic response surface in (19).

Step 5: The discrete distribution defined by as computed in Step 4 will roughly approximate the continuous distribution defined by , but the approximation can be improved by modifying the probabilities to constrain the 4 means and 4 standard deviations of the discrete distribution to exactly match the values and that were computed from the response surface fit of Step 4. The final probabilities , , are computed as the solution to the following constrained optimization problem:

Find positive that minimize the Kullback–Leibler divergence

 KL=∑sg(λ(s))log[g(λ(s))/πs], subject to the 9-dimensional constraints (20) ∑sπs=1;∑sπsλ(s)=λfit; Missing or unrecognized delimiter for \bigr

where the last two equations are each interpreted as 4 constraints, one for each component of . The constrained minimization problem of (3) is solved using the method of Lagrange multipliers combined with a Newton–Raphson solution of the resulting 9 equations.

Thus, the used in (3) are the solution to (3) rather than the more direct values in (3). They differ from (3) by incorporating the Jacobian terms of (17) and the further modifications needed to satisfy the constraints in (3). The values of used in (3) are the back-transformations defined by (3) of the final points used in Steps 4 and 5.

### Estimates Using Regularized Logistic Regression (RLR)

To compare the MBLR results to standard logistic regression, and still be able to avoid problems with nonidentifiability, as discussed above, the RLR algorithm is defined by fitting MBLR under the constraints

 σA=5,σ0=5,σB=0.001,τ=0.001. (21)

Setting and very close to effectively constrains the estimates of covariate-by-treatment interactions to be 0. Setting and to be very large prevents the estimates across different response events from shrinking toward each other The rationale for thinking that a prior standard deviation of 5 is very large for a logistic regression coefficient is as follows. Remembering that the coefficients are interpreted as logs of odds ratios, an increase of 5 in a coefficient corresponds to a multiplicative factor of in an odds ratio. With respect to the assumed normal prior distributions in equations (3)–(8), the prior standard deviation of 5 implies that about one-third of all estimated odds ratios are expected to be outside the range of (, 148). This certainly seems to be well beyond the range of expected odds ratios in any medical risk estimation situation. See Gelman et al. (2008) for a related discussion. [In the Bayesian setup described above, we use as default limits for the prior standard deviations (, ). Considering a prior standard deviation to be as large as 1.5, where , implies that about one-third of the estimated odds ratios would be outside the range of (, 4.5), which seems a bit of a stretch, but barely conceivable.]

Using the values in (21) for the prior standard deviations, this alternative weak Bayesian prior method estimates the parameters and their variances using the iterative Newton–Raphson estimation described above. The resulting estimates are computationally reliable even if many of the response events are sparse. Such estimates perform very little shrinkage across response models because the prior standard deviations in equations (3)–(8) are large compared to the standard errors of the (estimable) logistic regression coefficients. However, the MBLR and RLR models as formulated will not protect against problems of estimability in case every response is quite sparse, because of the use of an improper prior for the prior means . If certain covariate or treatment categories are perfectly correlated with every response, then one must either drop such predictors or add additional response variables.

The Bayes factor for can be computed and compared to the 33 values found in the final grid of the Bayesian estimation described above, which provides further evidence regarding the prior standard deviations. In particular, large Bayes factors against imply that the MBLR model fits the data better than the RLR model, meaning that there is significant evidence that either the responses have similar covariate profiles or that there are significant covariate-by-treatment interactions.

### Confidence Intervals for Odds Ratios

Let the final estimate of, for example, be , so that the odds ratio point estimate is . Using the normal approximation to the posterior distribution of the coefficients and the estimates of in equation (3), 90% confidence intervals (posterior credible intervals) for the corresponding odds ratios are given by

 OR.05 = exp[bgk−1.645√v(βgk)]

For the main effects of covariates or for treatment, these provide confidence (credible) intervals for odds ratios of the predictor vs the response outcome. The odds ratio comparing two categories of a multicategory covariate would be found by taking the ratio of the corresponding exponentiated coefficients.

Interpreting the interaction effects of covariates with treatment arm is tricky, since it would involve ratios of odds ratios. To aid in interpretation, one can present in addition to the interaction coefficients themselves, the sums of the treatment coefficient plus the interaction coefficients. Confidence intervals for these sums are formed in the usual way, taking into account the covariances between the treatment coefficient and the interaction coefficients. Whenthese sums and their confidence limits are exponentiated, we get estimates and limits for subgroup treatment-by-outcome odds ratios. These estimates are oriented toward finding potentially vulnerable subgroups where the adverse effect risk of treatment is especially high.

## 4 Discussion of Methods and Alternate Models

The philosophy of estimation is not to try to model the medical mechanisms perfectly, but to provide a reliable compromise between pooling related sparse events in order to increase the sample size, and fitting separate models to each event, with the corresponding loss of power due to small samples. The selection of which issues to include in an MBLR is important. There needs to be at least a superficial plausibility that all or many of the selected outcome issues might have similar odds ratios with treatment and with the covariates in the model, what Bayesians call exchangeability. Sometimes it may be difficult to decide what other issues to include if attention has focused primarily on a single and seemingly unique issue such as subject death. Because it takes several degrees of freedom to estimate a variance component, the values of some of the standard deviations in equations (3)–(8) may be poorly estimated if and/or are not large, but the use of Bayes factors and the computation of the in (3) and (3) allow some assessment and adjustment for this uncertainty.

The current model is quite similar in spirit to, and somewhat inspired by, that proposed by Berry and Berry (2004). They also assume that drug adverse reactions are classified into similar medical groupings in order to use a shrinkage model to allow borrowing strength across similar medical events. They focus on treatment/comparator odds ratios only and do not consider covariates or the use of logistic regression. They also define a more complex model having many more variance components than the one proposed here.

One might ask the question of why estimate covariate effects at all, since in a randomized study the covariates should all be nearly orthogonal to the treatment variable? The rationale in MBLR is not so much to adjust for potential biases in the treatment main effect, but to be able to include treatment-by-covariate interactions in order to detect possibly vulnerable subgroups that might react differently to the treatment. When is large (many covariate categories) it will often be difficult to estimate so many parameters unless all the issues being modeled occur frequently. The multiple comparisons involved make any search for vulnerable subgroups difficult and subject to false alarms, especially for sparse events. This makes the use of Bayesian shrinkage of the interaction terms in (8) especially valuable: it negotiates the bias-variance trade-off among multiple event rates having possibly very different sampling variances. Without this smoothing effect, estimates of interactions affecting rare events will be so variable as to be useless, which is why the RLR method is defined to estimate only main effects.

The importance of avoiding undue rejection of the null hypothesis in the presence of multiple post-hoc comparisons is central to being properly conservative when evaluating treatment efficacy. There is a question as to how much this conservatism should extend to exploratory analyses of safety issues. For example, the prior specification (8) shrinks the interaction prior means toward 0, whereas the main effect prior means and are not shrunk toward 0. We prefer to maintain maximum sensitivity to safety main effects, while accepting that true interaction effects are less likely and need more false alarm protection. We also encourage parallel computation of the minimal-shrinkage regularized LR estimates discussed above, so that the analyst can perform an easy comparison and sensitivity analysis of the effects of shrinkage.

The prior distributions in equations (3)–(8) are all assumed to be normal distributions. Many Bayesian researchers have pointed out that since normal distributions generate few outliers, outliers may be correspondingly suppressed under this assumption.Commonly suggested alternative prior distributions are the double exponential and Student’s , which tend to shrink outliers less. The double exponential(“lasso”) prior has nonstandard theoretical properties that make computation of standard errors of coefficients problematical, and so have been ruled out for this application. Alternative distributions like Student’s  are difficult to handle computationally in our complex situation where there are hundreds of coefficients and multiple variance components. The normal model that we use has a concave log posterior density function and the iterative estimation algorithm is guaranteed to converge.

There is a similar computational feasibility rationale for using the discrete approximation to thedistribution of prior standard deviations. It is more common in the recent Bayesian literature to use Gibbs sampling or another Markov chain MonteCarlo (MCMC) method to estimate the posterior distributions of all parameters. Two reasons for preferring to avoid such methods are as follows: first, we want to allow scientists without much statistical sophistication, much less experience with fancy Bayesian computational methods, to use MBLR and these users would have trouble assessing convergence of such high-dimensional MCMC runs. Second, these users might also be uncomfortable with the fact that repeating an analysis on the same data typically leads to slightly, but noticeably, different answers. The method for handling the variance component estimation outlined above provides computationally and statistically reliable answers within a feasible computational burden. As described above, there are three roughly equally expensive stages in the model fitting computations: the two preparatorystages of finding the maximum of the posterior distribution and then evaluating it on an initial grid to find scale parameters in each direction, and the last stage of evaluating the model on the final grid to approximate the posterior distribution of the variance components.

## 5 Example Analysis

### Data Description

The data used for the example analyses are from a pool of eight studies, kindly contributed by an anonymous partner. Four of the studies were for one indication and four were for a second indication. There were a total of 5752 subjects in the pooled studies, 3110 in the Treatment arm and 2642 in the Comparator arm.

Display 1 shows statistics from these studies for a set of ten issues related to dehydration and/or renal function. All ten issues show up with greater frequency in the treatment arm than in the comparator. The final two columns are the endpoints of 95% confidence intervals for the odds ratios comparing treatment and comparator groups in the pooled data, computed using a normal approximation for the log (odds ratio) after adding 0.5 to every cell of each table. It is clear that many of these issues are associated with treatment, and we wish to investigate the commonality of these medically related issues, as well as the possibility that certain subgroups of subjects may be more or less affected by these associations.

Display 2 shows the four covariates selected as grouping variables for this analysis: Gender, Study ID, Renal History and Age. Recall that of the 8 studies being pooled, there were 4 studies for each of two potential indications for the drug. The Study ID values of A1–A4 and B1–B4 distinguish the studies for indication A and indication B. The Renal History variable distinguishes those subjects whose medical history (before randomization) includes one or more renal problems. As can be seen from Display 2, there are many more male than female subjects, and the age range 51 to 75 predominates. Three of the studies for indication A had about a 3:1 split of Treatment to Comparator subject counts, while the other five studies are more equally split. Study B2 had only 28 subjects total, while Study B4 had 4133 subjects, over two-thirds of the total in the pool. Only about 7% of the subjects had a previous history of renal problems.

Display 1 shows that five of the ten issues affected fewer than 10 Comparator-group subjects, whereas there are 16 separate covariate groups in Display 2. This makes it unlikely that those rare issues would occur in every treatment–covariate combination, which is necessary for convergence of a standard LR where the model includes all treatment–covariate interaction terms. In fact, only 3 of the 10 issues satisfy this condition, confirming the necessity of some special technique such as MBLR to try to estimate treatment-by-covariate interactions, and, in fact, even a main-effects only model would not be estimable by standard logistic regression applied to the rarer of these response issues, making the regularized LR necessary for this example.

### Posterior Distributions for Prior Standard Deviations

This example has , and , with the total number of parameters (elements of ) to estimate being , with degrees of freedom. Display 3 shows various results as a function of the four prior standard deviations. The top row 0 describes the regularized LR case where , , , . The rows labeled 1–33 in Display 3 show results for the final grid used to approximate the posterior distribution of . The row 1 values are the maximum posterior estimates (transformed from the scale of to that of ) estimated by the final response surface fit described above. Rows 2–33 show the remaining values of the final stage grid. In this example, all stages of the estimation required a total of about 400 iterations through the data, that is, about 400 evaluations of (10) and its first and second derivatives with respect to parameters.

The rightmost column in Display 3, headed “PROB,” shows the values of , as defined by (3). As discussed above, these probabilities have been adjusted so that the discrete distribution of matches the means and variances of the continuous distribution of as estimated by the response surface fit to the values of .

The bottom two rows of Display 3 show the posterior mean and standard deviations of the components of using this 33-point discrete approximation. It can be seen that the values are approximately (, , , ). The value in the row marked “Mean” and the column marked “PROB” is computed as , which is a measure of the dispersion of the probabilities . The smaller it is, the more spread out are the probabilities among the 33 grid points. Large values of , say, values above 0.2, would imply that the scale or location of the grid might be poorly chosen, so that only a few points on the grid are very probable.

### Comparison of MBLR and RLR Estimates of Treatment Effects

Display 4 shows estimation results for the treatment main effects for each of the two methods and for each response event and for the prior mean of all responses. The prior mean odds ratio is defined as , whereas the treatment odds ratio for the th response is . For each combination the odds ratio and its approximate 90% confidence (credible) interval are shown, based on (22). Comparing the MBLR to the RLR estimates, we see that the MBLR estimates are pulled away from the RLR estimates and “shrunk” toward the MBLR priormean, which represents the average or typical odds ratio across response issues. The degree of shrinkage is greatest for the highest-variance RLR estimates, corresponding to the rare issues such as Anuria and Urine output increased. For these two issues, although the MBLR odds ratio estimate is smaller than the corresponding RLR odds ratio, but so are their posterior variances, so that the lower bounds of the MBLR intervals are greater, providing greater statistical significance from the null hypothesis of . Even though all 8 occurrences of Anuria were in the treatment arm, the treatment effect does not show up as significant with the multiple-predictor RLR model—the MBLR estimate of the effect on Anuria seems more reasonable.

Inspection of Display 4 shows that not all of the MBLR confidence intervals are narrower than the corresponding RLR interval. The reverse is true for the more frequent responses such as Hyperkalaemia and Thirst. In these cases, the MBLR estimates do not benefit much from the relatively weak prior distribution, and their posterior variances are adversely impacted by the uncertainty in the variance component estimation as well as the need to estimate all of the interaction parameters, which are assumed away by the RLR model.

### MBLR Estimates of Prior Means

Display 5 graphs the MBLR estimates of the (exponentiated) prior means , with their 90% CIs. These are interpreted as effects for a “typical” response variable. Remembering that coefficients for categories of each covariate must sum to 0, the corresponding odds ratios must average to 1 when plotted on a log scale. The middle interval shows the main effect of treatment, the intervals above show covariate main effects, and the intervals below show treatment interactions. As also shown in Display 4, the treatment effect prior mean is about 4.4 on the odds ratio scale, with 90% limits of (2.7, 7.1). The main effects of covariate estimates, shown above the treatment line, can be thought of as the effects of covariate categories within the comparator arm, and as centers of shrinkage across the responses. Thus, the rates of these events in the comparator arm are somewhat less for Age:50 and under and for Renal History:N. Also, Study:A2 had a particularly high event rate, while Study:B4 had a particularly low event rate. But none of these differences in groups based on covariates are as large as the treatment effect.

The lower set of estimates in Display 5 portray the treatment–covariate interactions. As can be seen, these effects are smaller than the main covariate effects and much smaller than the main treatment effect. The treatment effect estimates within the four studies for Indication A are all larger than the four estimates for the Indication B studies, but the uncertainty intervals all overlap considerably. Although this does not rule out larger interaction effects for some of the response variables, the fact that , is about 0.3 and both , and are each less than 0.2 means that such effects for individual responses are also likely to be fairly small. Since is about 0.76, there is more room for variation in treatment main effect among the responses, as we also saw in Display 4, where the Treatment odds ratios ranged from 1.3 for Hyperkalaemia to 7.4 for Polydipsia.

The prior means of the treatment by covariate interactions (the bottom 16 intervals of Display 5) have especially small posterior means, as might be expected given that they have been shrunk toward 0 because of the small value of , with posterior mean 0.196 in Display 3. Another way of saying this is that the estimates of were so small compared to their sampling variances that only a small value of is compatible with these results and the assumption of (8).

The estimates of prior means under the regularized LR model are less interesting. Assuming that  and are large implies that and cannot be estimated well and will thus have wide confidence intervals, and of course assuming no interactions means that the for .

### Breakdown of Estimates by Study for Issue Pollakiuria

Display 6 shows the MBLR 90% intervals for odds ratios relating to the Study ID covariate and the issue Pollakiuria (very frequent daytime urination). The table information in Display 1 shows that this was highly associated with treatment (193:34 split by treatment:comparator). Our discussion focuses on whether and how the results differ by the studies being pooled, and what summary conclusions are justified across studies. The goal is similar to meta-analysis, except that we have complete data from each study and so can adjust for more potentially biasing between-study differences.

The top eight intervals in Display 6 show the Study main effects, corresponding to relative differences among the comparator arm odds of reporting Pollakiura within the studies. These differential estimates are adjusted for the other covariates Age, Gender and RenalHistory. There are relatively large and significant study effects, especially betweenStudy A2 and Study B4, where the estimated odds ratio is over 8 (2.8 versus 0.34 on the horizontal axis), with relatively narrow 90% intervals.

The next set of eight intervals shows the Treatment by Study interaction estimates. Although the differences are not as large as in the comparator arms, the pattern is similar, in that the studies that had a large base rate of Pollakiuria tended to have larger increases in adjusted Pollakiura rates. The three studies having the largest treatment effects (A1, A2, A4) are all based on Indication A. These estimates are somewhat hard to interpret, being ratios of odds ratio estimates. The lower set of intervals return to the simple odds ratio scale by adding (on the log scale) the interaction estimates to the main effect of treatment. The very bottom interval shows the 90% interval for the treatment main effect, and the central points for the eight intervals above it average to the center point at the bottom.

These last 9 intervals in Display 6 are reminiscent of the way a meta-analysis is often presented in a “ladder plot,” with estimates of effect for each study, and followed by a combined treatment estimate at the bottom. However, there are certain differences due to the more complex MBLR model. First, as mentioned above, these estimates have been adjusted for differential covariate distributions across studies. Second, the Pollakiuria estimates here have been shrunk toward the prior mean estimates of the odds ratios involving all responses. Third, the shrinkage of interaction estimates toward 0, governed by in (8), is similar to the shrinkage toward a common mean effect that occurs in a random effects meta-analysis. Fourth, the weight that each study contributes to the overall estimate is governed by a more complex formula than in either the standard fixed or random effects meta-analyses. However, it does share with the random effects methodology the fact that relative weights are much attenuated compared to relative sample sizes. Finally, this more complex calculation means that the single-study treatment estimates in the above MBLR graph do not preserve the between-study differences, as might be shown in a standard meta-analysis presentation.

The response Pollakiuria was chosen as the example for Display 6 because that issue showed a greater Treatment-by-Study effect than other issues: for example, in Display 6 the Trt\tsup*Study:A1 effect is 1.33, while the Trt\tsup*Study:B4 effect is 0.79, for a ratio of 1.68, and the two 90% intervals barely overlap. Is this post-hoc selection legitimate? Clearly, this way of finding “interesting” results is biased in many standard settings. However, the Bayesian shrinkage methodology tends to offset such biases, as will be seen in the simulation results to follow.

## 6 Simulation Study of MBLR and RLR

The statistical properties of MBLR are studied using a simulation of the model that MBLR assumes. The purpose is to compare the accuracy of the MBLR results with that of the RLR results in the context of a situation like that in the example of Section 5, where there are rare events and sparse data. The simulation emulates that example in the sense that the distribution of subject covariates and treatment assignment matches the data in Section 5 exactly. Also, the list of response issues is the same and the baseline probabilities (as measured by the intercept term in the logistic regressions) of each response in the simulation are similar to that in the data of Section 5. The protocol for each simulation involves the following steps:

[13.]
1. Set the intercept term values , one for each of the responses.

2. Set the prior means .

3. Set the four prior standard deviations .

4. Repeat steps (5 through 12) times:

[12.]
1. Draw from , ; .

(Note: all random variable generation is performed using built-in functions. Also, constraints that must sum to 0 as varies over the categories of each single covariate are enforced by subtracting means over the corresponding covariate from the originally drawn . An analogous procedure is used in steps 7 and 8.)

2. Draw from , .

3. Draw from , .

4. Draw from , ; .

5. For each set of subjects having the same covariate values and treatment assignment, compute and using (1) and (2), ; .

6. Draw from binomial , ; .

7. Fit both the MBLR and the RLR model to the counts .

8. Update cumulative summaries of estimation results for each simulation as described below.

5. Create reports summarizing the estimation accuracy of the two methods regarding all parameters.

### Simulation Summary Statistics

There are parameters being estimated and two estimation methods: MBLR and RLR, so the total number of estimators being evaluated is . For simulation () and for estimate (), let:

1. value of parameter for simulation , as defined by steps 1, 2, 5, 6, 7 and 8,

2. value (posterior mean) of parameter for simulation ,

3. SE (posterior standard deviation) for parameter estimate for simulation ,

4. [average estimation error],

5. [square root of mean squared estimation error],

6. [average squared standardized estimation error],

7. [proportion of times 90% CI is too low],

8. [proportion of times 90% CI is too high].

These summary statistics focus on the estimation accuracy of and also on the calibration accuracy of . We want BIAS and RMSE to be as close to 0 as possible, we want to be near 1, and we want CI.05 and CI.95 to be near 0.05. The  estimates can be grouped by the two methods, the () responses (counting PRIOR_MEAN as a generalized response) and the different term definitions. The term definitions fall into three general term types:

 COV = {Ag,αgk}, TREAT = {B0,β0k}, TRT∗COV = {Bg,βgk}.

We can summarize the simulation of the estimates by averaging the six accuracy summaries listed above over groups defined by method, response and/or term type.

Finally, for the MBLR, we can summarize the posterior means and standard deviations of the four estimated prior standard deviations.

### Simulation Design

The simulations are designed to compare variations in three design factors, each at two levels, so that 8 separate simulations were performed, with each simulation having replications, and so that a simple comparison of the two levels of each factor will be based on 1000 replications at each level. The three design factors correspond to two different choices at each of the first three steps in the simulation protocol given above:

[ ]
1. Frequent responses only (most frequent 5 in the example data).

2. Both frequent and rare responses (all 10 issues in the example data).

The situation uses the same 10 issues as in the example of Section 5, with the values of the intercept terms set equal to the estimated values from the real data, shown in Display 7. The baseline probabilities are defined as , also shown, which range from 0.055 for Hyperkalaemia to 0.00044 for Anuria. When , the most frequent 5 response issues are used, as shown in rows 1–5 of Display 7.

3. Average of main ( for all , ).

4. Prior Means of main effects relatively large nonzero values.

For Level 2, the values of are set to two times the values estimated in the analysis of the actual data. Display 8 shows the coefficient values as estimated and as used in the simulations.

5. , , , (Small PSDs).

6. , , , (Large PSDs).

The Level 1 values of prior standard deviations are similar to those estimated from the example data, while the Level 2 values are significantly larger.

All simulations create 5752 subjects having the same joint distribution of covariates and treatment allocations as the actual data and as summarized in Display 2. Thus, and , when , while , when .

### Simulation Results

Display 9 shows summaries of the distributions of (square roots of) variance component estimates, which are denoted PSDs for prior standard deviations in the model equations (3)–(8). Since there are 4 separate PSDs in the model, and the simulations are run at two sets of PSDs, the scales of all the PSDs in Display 9 have been normalized by dividing each estimated PSD and each estimated sampling standard deviation by the true PSD used in the corresponding simulation. Thus, a value of 1 for an average estimated PSD in Display 9 is interpreted as an unbiased estimate, and a value of 0.1 for the standard deviation of the sampling distribution of a PSD in Display 9 is interpreted as a coefficient of variation of 10%.

Display 9 has 8 columns and 7 rows. There are 4 pairs of columns, corresponding to the sampling means and standard deviations of the estimates of each of the four PSDs in the model. The 7 rows of Display 9 correspond to different subsets of the 2000 simulations. Row 1 shows averages over all simulations, whereas the other rows show averages over a subset of 1000 simulations corresponding to the levels of each of the factors in the experimental design. For example, consider the columns labeled and in Display 9. In row 1, 1.005 implies that overall the mean of estimates of the Treatment PSD are within 0.5% of the true value, and the next value of implies that individual estimates are typically about 27% off the true value. Of course that value is principally reflective of the sample size and the experimental design of the clinical studies. All simulations used the same clinical study setups, but there was variation according to the three factors in the simulation. Going down the rows in these same two columns, we see that estimates of had almost exactly the same means and standard deviations whether all ten responses were being simulated or whether just the most frequent five responses were simulated. Similarly, the next two rows show that there was virtually no difference in the sampling means and standard deviations of between the situation where the average effects are about 0 versus relatively large effects. However, the final two rows of the Display show that when all four PSDs are small (, , , ) the estimate of is biased upward about 13%, and when all four PSDs are large (, , , ) it is biased downward by about the same percentage. The direction of the biases implies that estimates tend to be somewhat more central with respect to the restricted range imposed () than the true value, thus moderating the estimates. The coefficient of variation of is about 35% in the former case and about 19% in the latter case. This corresponds to roughly the same standard deviation of the estimate of whether is 0.6 or 1.2.

This effect only shows up with respect to ; the other columns in Display 9 show that mean estimates of , and are relatively unaffected by any of the three factors in the simulation, especially  and . Consideration of degrees of freedom may explain this—these two variance components have degrees of freedom, whereas has df and has df, so one might expect them to be harder to estimate (although the definition of degrees of freedom is somewhat fuzzy in this nonlinear Bayesian setting). Estimates of seem to be most variable percentagewise, with coefficient of variation in the 30–40 percent range. In all cases the coefficient of variation is larger for the smaller true PSDs. The standard deviation of estimation decreases when the true PSD decreases, but not fully proportionally.

However, remember that the goal of the analysis is not to estimate the variance components per se, but to use them to define a model that can better estimate the logistic coefficients by adjusting to global patterns in the data across responses and predictor categories. Each individual estimation does not assume that the PSDs are exactly equal to their posterior mean, but rather the estimation involves an integration across the posterior distribution of the PSDs. In that respect, it is interesting to examine the posterior standard deviations of the PSDs. They have not been included in Display 9 in order to save space, but in fact the average of the posterior standard deviations across simulations was remarkably similar to the sampling standard deviations of the posterior means of each PSD. They typically differed by only 10% or so for each of the 8 sets of 250 simulations. Thus, our model expects that the PSDs will be hard to estimate and works within that uncertainty.

### Estimation of Logistic Coefficients

Display 10 summarizes the simulation distributions of the various logistic regression coefficients. Part (a) of Display 10 focuses on the main effect of Treatment. The first two rows of Display 10(a) compare the Treatment effect accuracy of the RLR estimates to that of the MBLR estimates. The first four columns refer to the estimation of the prior mean coefficient, , what might be called the “all response summary,” while the last four columns refer to the estimation of coefficients, , for the individual responses. Across all 2000 simulations, the RMSE for RLR is almost double that of MBLR for estimation of , and more than double, on average, for estimating the . Since statistical efficiency is typically inversely proportional to the square of RMSE, this implies that MBLR is about 4 times as efficient as RLR at estimating treatment/comparator odds ratios in this setting.

The statistic is designed to measure the calibration of the posterior standard deviations computed by a method to the actual sampling distribution, where implies perfect calibration. When , the claimed standard errors of coefficients are too optimistic (too small), and the reverse is true when