# Bayesian Bivariate Subgroup Analysis for Risk-Benefit Evaluation

###### Abstract

Subgroup analysis is a frequently used tool for evaluating heterogeneity of treatment effect and heterogeneity in treatment harm across observed baseline patient characteristics. While treatment efficacy and adverse event measures are often reported separately for each subgroup, analyzing their within-subgroup joint distribution is critical for better informed patient decision-making. In this paper, we describe Bayesian models for performing a subgroup analysis to compare the joint occurrence of a primary endpoint and an adverse event between two treatment arms. Our approaches emphasize estimation of heterogeneity in this joint distribution across subgroups, and our approaches directly accommodate subgroups with small numbers of observed primary and adverse event combinations. In addition, we describe several ways in which our models may be used to generate interpretable summary measures of benefit-risk tradeoffs for each subgroup. The methods described here are illustrated throughout using a large cardiovascular trial () investigating the efficacy of an intervention for reducing systolic blood pressure to a lower-than-usual target.

Keywords: Heterogeneity of treatment effect; patient-centered outcomes research; personalized medicine; benefits and harms trade-off

## 1 Introduction

Both treatment effectiveness and treatment safety often vary according to patient sub-populations that are defined by observable baseline patient characteristics. Although the importance of evaluating the consistency of treatment effectiveness is often recognized when conducting subgroup analyses, heterogeneity in treatment safety is not examined as frequently. Moreover, when heterogeneity in treatment safety is addressed, such subgroup analyses are typically performed separately from the heterogeneity of treatment effect (HTE) analysis. While useful for addressing question of HTE and heterogeneity in adverse effects, such separate analyses ignore potentially important relationships between primary outcomes and safety outcomes. Dependencies between these two outcomes can substantially alter the risk-benefit considerations when compared with just analyzing their marginal distributions. In order to improve the relevance of subgroups analysis for assessing variation in patients’ risk-benefit profiles, it is critical to focus on examining differences in joint patient outcomes (Evans & Follmann (2016)) within key patient sub-populations.

Over the course of a clinical trial, patients experience a collection of outcomes of which some may be related to treatment benefit while others may be related to an adverse effect of treatment. From a patient-level perspective, a risk-benefit assessment involves examining the potential set of outcomes that would occur when taking a proposed treatment versus the potential set of outcomes that would occur under a control treatment. A treatment may be considered superior to the control if the likely set of outcomes under treatment is better than the likely set of outcomes under the control. In this article, we focus on the frequently occurring case where the recorded set of patient outcomes includes a time to some primary event and a binary indicator of whether or not a treatment-related adverse event occurred at some point during the period of patient follow-up. Similar ways of addressing both efficacy and safety have been proposed, for example, in the analysis of dose-finding studies (Padmanabhan et al. (2012)), but to our knowledge, little work has been done in the context of bivariate subgroup analysis.

Methods based on Bayesian hierarchical models have a number of advantages when conducting subgroup analyses (Jones et al. (2011) or Henderson et al. (2016)), and many of these advantages should be more strongly felt when extending such methods to bivariate patient outcomes. It is well-recognized that subgroups defined by multivariate patient characteristics often have many subgroups with small within-subgroup samples sizes. This leads to very noisy estimates of subgroup-specific quantities - a phenomenon which is likely to be further exacerbated when examining the joint distribution of patient outcomes. A key advantage of a Bayesian approach coupled with hierarchical modeling is that subgroup-specific parameter estimates are partially driven by data in that particular subgroup and partially driven by patient outcomes in all the other subgroups. Allowing such “borrowing of information” across subgroups leads to shrunken estimates of subgroup parameters which frequently improves both the accuracy and stability of estimation (Efron & Morris (1973)). In addition to simply generating more stable shrinkage estimates, Bayesian hierarchical models allow one to specify how subgroup parameters are related to one another. In this article, we focus on two approaches for modeling the relationships between the subgroup parameters (a saturated and an additive model), but many other alternatives could easily be incorporated into our framework for bivariate subgroup analysis.

In this article, our main goal is to harness the advantages of Bayesian modeling to describe the joint distribution of a primary and an adverse event across patient subgroups. To this end, we outline a within-subgroup parametric model which assumes an exponential distribution for both adverse event-free and adverse-event-occurring survival. While such a parametric assumption may not be strictly correct, our assumption is that survival follows an exponential distribution within each subgroup rather than the entire population represented by the trial. We argue that this is a sensible choice given the potentially small number of patients within any particular subgroup. Moreover, our approach for direct modeling of patients’ joint distributions provides great flexibility to analyze patients’ risk-benefit tradeoffs from a variety of perspectives. For example, we demonstrate how our model may be used to assess heterogeneity with respect to either a composite outcome which weights adverse event-free and adverse event-occurring survival differently, and how our model may be used to evaluate heterogeneity with respect to a treatment effect based on the probability of achieving an improved outcome.

This paper has the following organization. In Section 2, we begin by describing our motivating example - the SPRINT trial, and we describe both the subgroups to be analyzed and the primary and adverse events measured in this large clinical trial. In Section 3, we outline our parametric model for the joint distribution of the time-to-a-primary event and a binary adverse event, and we discuss its use in the context of bivariate subgroup analysis. Section 4 describes a general approach for specifying a prior distribution for the model parameters of Section 3, and we describe three particular ways of modeling the distribution of these parameters. In Section 5, we detail several interpretable measures of variation in patients’ risk-benefit profiles. Here, we describe how our Bayesian framework may be used to make inferences about such measures. Patient outcomes from the SPRINT trial are used to illustrate the use of these measures. Section 6 discusses the use of model diagnostics and model comparison, and Section 7 concludes with a brief discussion.

## 2 The SPRINT Trial

The Systolic Blood Pressure Intervention (SPRINT) trial (The SPRINT Research Group (2015)) was a large trial investigating the use of a lower-than-usual systolic blood pressure target among individuals deemed to be at increased cardiovascular risk. Specifically, the trial was designed to compare an intensive treatment (a systolic blood-pressure target of less than 120 mm Hg) versus the standard treatment (a systolic blood-pressure target of less than 140 mm Hg). In total, individuals were enrolled in the trial, and each trial participant was classified as being at increased cardiovascular risk but without diabetes and had a systolic blood pressure greater than mm Hg. Of the trial participants, were randomized to the intensive treatment arm while were randomized to the standard treatment arm.

In the SPRINT trial, patient outcomes were recorded as a composite outcome where a primary event (PE) was said to occur if any one of the following five events took place: death from cardiovascular causes, stroke, myocardial infarction, acute coronary syndrome not resulting in myocardial infarction, or acute decompensated heart failure. At the conclusion of the trial (median follow up time of years), a total of PEs had occurred with PEs occurring in the intensive treatment arm and PEs occurring in the standard treatment arm. Based on these differences in the distribution of the PEs across treatment arms, the intervention was determined to have an overall benefit (an estimated log-hazard of , confidence interval: ). In addition to the primary outcome, participants were also monitored for the occurrence of serious adverse events (SAEs). The most common SAEs included acute kidney injury or acute renal failure, injurious fall, electrolyte abnormality, syncope, and hypotension. Each of the observed SAEs were classified according to whether or not they were possibly or definitely related to the intervention. Of the observed treatment-related SAEs, individuals in the intensive treatment group experienced a treatment-related SAE while only participants in the standard treatment group experienced a treatment-related SAE.

While the marginal occurrence of SAEs in each treatment arm suggests that the standard treatment carries a lower risk of SAEs, the joint occurrence of primary events and SAEs may be more informative from a patient-centered perspective. That is, the risk-benefit tradeoffs involved in choosing a treatment strategy are more apparent when the likelihood of each PE-SAE combination can be examined. This can be particularly important in cases where the dependence pattern between the PE and SAE differs substantially across treatment arms.

Standard Treatment | Intensive Treatment | |||
---|---|---|---|---|

SAE | No SAE | SAE | No SAE | |

PE | 18 | 301 | 30 | 213 |

No PE | 100 | 4264 | 190 | 4245 |

Table 1 shows the number of joint occurrences of PEs and treatment related SAEs in the SPRINT trial. As evidenced by the counts in Table 1, most patients in both treatment arms remained free from both PEs and SAEs over the course of the trial. A notable difference between the two treatment arms is the difference in the proportion of PEs that were accompanied by an SAE. However, only examining the joint counts of PEs and SAEs, of course ignores the timings of the PEs. In the next section, we outline a subgroup-specific parametric model for describing the joint distribution of a time-to-event primary outcome and a binary safety outcome.

## 3 Bivariate Subgroup Analysis

### 3.1 Data Structure and Summary Statistics

We assume that patients have been enrolled in a randomized clinical study where patients are monitored to determine whether they experience either a primary event (PE) or an adverse event (AE) (or both). For the individual in the study, denotes the time-to-failure for the PE of interest. We observe and an event indicator where denotes time-to-censoring and denotes the indicator function. In addition to observing the pair , we observe an indicator of whether or not patient experienced at least one AE during the study. That is, if patient experienced at least one AE and otherwise. We let denote that patient was assigned to the treatment arm, and we let denote that patient was assigned to the control arm. We assume that distinct patient subgroups have been defined according to multivariate patient characteristics and that is a variable which indicates membership in one of the subgroups. For example, suppose that patient subgroups have been formed on the basis of age (young/old) and sex (male/female) combinations. In this case, we would have subgroups to represent each age/sex combination, and would be a variable indicating to which of the four categories the individual belongs.

Subgroup analyses that only examine the primary outcome typically utilize a collection of summary statistics which measure treatment effectiveness in each subgroup considered. Such summary statistics are usually computed for marginally defined subgroups (i.e., subgroups defined by looking at one variable at-a-time) or subgroups which are defined according multivariate characteristics. In our analysis of bivariate patient outcomes, we require that one compute a pair of summary statistics for each AE-treatment arm combination within each subgroup. The summary statistic is the number of PEs that are observed to occur among the patients in subgroup , treatment arm , and have AE status . Similarly, is the total follow-up time for the group of patients that are in subgroup , treatment arm , and have AE status . In addition to , we require that one compute the summary statistics for each treatment arm-subgroup combination. The summary statistic is the total number of AEs observed for those patients in subgroup and treatment arm . More formally, and are defined in terms of the individual-level responses as

Table 2 shows the summary statistics from the SPRINT trial using subgroups. These subgroups were formed by looking at all possible combinations of the following three baseline patient variables: chronic kidney disease (Yes/No), age (), and sex (male/female).

Subgroup | Standard Treatment | Intensive Treatment | |||||||
---|---|---|---|---|---|---|---|---|---|

CKD | Sex | Age | |||||||

No | Male | (3, 92.00) | (96, 5546.11) | 29 | (10, 174.05) | (54, 5446.53) | 61 | 3528 | |

Yes | Male | (3, 48.41) | (28, 1364.73) | 16 | (5, 103.15) | (25, 1227.70) | 32 | 858 | |

No | Male | (1, 19.57) | (46, 1277.47) | 8 | (1, 57.67) | (26, 1265.94) | 19 | 913 | |

Yes | Male | (6, 40.00) | (47, 977.71) | 15 | (7, 88.31) | (38, 974.85) | 31 | 730 | |

No | Female | (0, 40.08) | (31, 2641.97) | 13 | (0, 67.47) | (25, 2705.38) | 20 | 1706 | |

Yes | Female | (2, 42.41) | (12, 948.66) | 14 | (4, 50.23) | (16, 1000.35) | 16 | 617 | |

No | Female | (0, 37.84) | (16, 835.07) | 12 | (1, 60.98) | (18, 778.16) | 20 | 568 | |

Yes | Female | (3, 30.81) | (25, 612.09) | 11 | (2, 66.44) | (11, 634.87) | 21 | 441 |

### 3.2 Distribution of Key Summary Statistics and Joint Distribution of Primary and Adverse Events

In our analysis, we assume the number of observed events follows a Poisson distribution whose mean depends on the total follow-up time and the hazard rate within the subset of patients with the combination of treatment arm , AE status , and subgroup . Specifically, given , we assume is distributed as

The assumed Poisson distribution for the summary statistics is equivalent to assuming that, at the individual level, the time-to-failure of the primary event follows an exponential distribution with rate . More specifically, the likelihood function that results from assuming is the same (ignoring constant terms) as the likelihood function that results from assuming that

To fully define the joint distribution of the time-to-primary event and the AE indicator given treatment arm and subgroup information, we now only need to specify the conditional distribution of given and . Here, we assume that is a Bernoulli random variable with a success probability that depends on the treatment arm and subgroup. This implies the distribution of is given by

where is the number of individuals in treatment arm and subgroup .

The parameters characterize the joint distribution of the primary and adverse event within the subgroup and treatment arm . To see why this is the case, note that the joint probability may be expressed as

Comparing and for each subgroup enables one to compare the effect of the treatment on altering the risk-benefit profiles of patients in subgroup .

## 4 Modeling Subgroup Effects

### 4.1 Regression Models for Subgroup Parameters

As described above, our model for the summary measures depends on the collection of hazard rate parameters and the AE probabilities . Because the number of parameters is potentially quite large, models which induce shrinkage are needed to produce stable estimates of these quantities. We describe a general regression models for and which allow for shrinkage and allow for one to include information regarding how the subgroups parameters are related. In Sections 4.1.1-4.1.3, we describe three specific regression models, and in Section 4.2 we describe our prior specification for the parameters of two of these regression models.

We consider a regression setting with design matrices and for the hazard rate parameters and AE probabilities respectively. The matrices and are and respectively, and and denote the rows of and respectively. The hazard rate parameters and the adverse-event probabilities are related to and via

where . Here is the vector , and is the vector . In total, this regression model involves parameters. We describe a few possible ways of choosing and below.

#### 4.1.1 Example 1: Saturated Model.

In this model, the number of regression parameters is equal to the number of summary statistics. The design matrices and are assumed to be equal to the identity matrix, and each , are assumed to be vectors. This means that, for any combination of , we have

In this model, there is no regression structure linking the parameters or the parameters together in such a way that more strength is borrowed between subgroups which share a greater number of patient characteristics. Rather, the and are treated separately with no additional information used to indicate the relationships among the subgroups.

#### 4.1.2 Example 2: Additive Model.

In this model, the values of and are determined additively from the variables comprising subgroup . This approach is analogous to the Dixon and Simon model for subgroup analysis (see e.g., Dixon & Simon (1991), Jones et al. (2011), or Henderson et al. (2016)). The number of regression parameters in this model are determined by both the number of patient variables and the number of levels within each one of these variables.

To define the additive model, we suppose that the subgroups are formed by patient variables, and the patient variable has levels . If we consider modeling for subgroup , it is necessary to know the levels of each of the patient variables comprising subgroup . If the level of the variable in subgroup is denoted by , then

(1) |

where and for . As an illustration of (1), consider an example where four subgroups are formed from the combinations of the patient variables of age (young/old) and sex (male/female) and that male and young denote the first levels of the age and sex variables respectively. Suppose further that we label the four subgroups in the following manner: subgroup 1 denotes male/young (i.e., , ), subgroup 2 denotes male/old (i.e., , ), subgroup 3 denotes female/young (i.e., , ), and subgroup denotes female/old (i.e., , ). In this case, the would be expressed as

and the relation between the and would be expressed similarly.

#### 4.1.3 A Proportional Hazards Model

An even more parsimonious approach than the regression models discussed above is to assume instead that the hazard ratios do not vary across subgroups. Under this assumption, the parameters would be modeled as

and the hazard rates would be obtained from the AE-free hazard rates via where is the constant hazard ratio. With this proportional hazards assumption, there are now regression parameters rather than the parameters required for the regression models discussed in Sections 4.1.1 and 4.1.2. While this approach makes a quite restrictive assumption about how the AE-present and AE-free hazard rates are related, this could be a useful approach if the proportional hazards assumption is justifiable and is seen to be reasonable in light of model diagnostics.

### 4.2 Prior Specification

#### 4.2.1 Prior for Saturated Model.

For each value of , we assume the parameters are drawn from a common normal distribution with mean and variance

The distribution of all the parameters depends only on the following four vectors: , , , and . The mean vector for treatment arm is assumed to have the following bivariate normal distribution

where means that has a bivariate normal distribution with mean vector and covariance matrix . Similarly, the distribution for on the log scale is assumed to be given by

For the correlation parameters and , we assume that and for each . Our prior distribution for the is similar to the prior for the . Specifically, we start by assuming that

We then assume that for , and we assume that for .

In total, there are hyperparameters in this model. These are: , , , , , , , and . Our default choice is to set and to set . Our justification for these default settings is discussed in more detail below.

Our prior specification implies the are drawn from a common normal distribution with mean and variance . Because it is difficult to specify a range of reasonable values for the median of , we assign a vague prior distribution by setting , for . A similar justification can be made for setting . Turning now to the variation in we note that, for any pairs of subgroup indices , the distribution of the hazard ratio (given ) follows a log-normal distribution with parameters and . Our prior distribution for is based on the observation that the hazard ratio is most likely between and . Because , this means that this approximate probability is greater than whenever and greater than whenever . We want to choose a prior for such that is fairly large for the most probable values of . This can be done by assuming that follows a log-normal distribution with parameters and (i.e., setting . This means that has a median of and that the probability that is less than is approximately . A similar justification can be made for setting .

#### 4.2.2 Prior for Additive Model.

We assign vague priors to the intercept terms and . Specifically, it is assumed that

with and set to default values of and .

The hierarchical model for the remaining regression coefficients and is the same as that used for the regression coefficients in the saturated model. That is, we assume that

where . The prior distributions and default hyperparameters for , and the four vectors , , , and are exactly the same as described for the saturated model.

#### 4.2.3 Posterior Computation

The Bayesian bivariate subgroup models proposed here can be easily computed using STAN software. With STAN, we have implemented both the saturated and additive models. In our analysis of the SPRINT trial, we used posterior draws obtained from four chains each having iterations with the first iterations treated as burn-in. The code for implementing our models can be obtained from our website: http://hteguru.com/index.php/bbsga/

## 5 Measures of Risk-Benefit

### 5.1 Heterogeneity in Joint Binary Outcomes

A natural bivariate outcome of interest is survival past a specified time point coupled with an indicator of whether or not an AE occurred during the follow-up period. For this bivariate outcome, there are four possible results: survival past /no AE occurs, survival past /AE occurs, failure before /no AE occurs, and failure before /AE occurs. The subgroup-specific posterior probabilities of each of these four outcomes may be obtained from the models detailed in Sections 3 and 4. The differences in these probabilities between treatment arms provide a measure by which to compare how treatment influences the probability of experiencing each of these four outcomes. If we define the function as

the differences in these four joint probabilities may be expressed as

If the PE is regarded as more undesirable than the adverse event, then the parameters may be thought of as being ordered according to the desirability of the associated outcome. Specifically, represents the difference in the probability of achieving the most desirable outcome, namely remaining both PE and AE free until time . Similarly, represents the difference in the probability of achieving the second most desirable outcome, namely remaining free from the PE until time while experiencing an AE at some time before . The parameter represents the difference in probability of experiencing the PE while not experiencing the AE, and represents the difference in probability of experiencing both the primary and adverse event before time . It is worth noting that for each . That is, if the treatment increases the probability of one outcome relative to control, it must decrease the probability of another outcome (or group of outcomes) by a corresponding amount.

Using the SPRINT trial data with the subgroups defined by chronic kidney disease status, age, and sex, Figure 1 plots the posterior means of . The posterior quantities shown in Figure 1 were computed using the saturated model. For each of the forest plots in Figure 1, the solid vertical line represents the average value of the posterior means of across subgroups, i.e., where is the posterior mean of . The quantity can be thought of as an estimate of the overall treatment effect for PE-AE category . Looking at Figure 1, the estimated overall treatment effect is negative for PE-AE categories 1 and 3 and is positive for categories 2 and 4 though the magnitude of the estimates for categories 3 and 4 is very small. Specifically, the values of were , , , and . These values suggest the intensive treatment increases the probability of 3-year PE-free survival while experiencing a treatment-related SAE, and it reduces the probability of experiencing a PE without an SAE. However, the intensive treatment also appears to reduce the probability of remaining both PE and SAE free for 3 years by an a similar amount to the increase in the probability of outcome . Comparing the estimates , , and leads to a somewhat ambiguous picture about the effect of the intensive treatment on improving patient outcomes. A more complete evaluation of the risk/benefit effect of the intensive treatment would involve consideration of patients’ “baseline bivariate risk”. Specifically, the baseline bivariate risk refers to the probability of each PE-AE combination occurring under the control arm. Such baseline information would be useful in assessing the meaningfulness of any change that the treatment induces on the probability of outcome .

Turning now to the variation in across subgroups, two subgroups from Figure 1 seem to stand out. The more prominent of these is the no-CKD/Age /Female subgroup. In this subgroup, the credible interval for does not cover the overall treatment effect estimate , and the credible interval for nearly does not cover . This suggests the effect of the intensive treatment on altering a patients’ PE-AE outcome profile is somewhat different in the no-CKD/Age /Female subgroup than in the broader population represented by the trial. Specifically, compared to other groups in the trial, the intensive treatment for those in the no-CKD/Age /Female subgroup had almost no effect on any of the PE-AE probabilities as the estimates for this subgroup were much closer to zero when compared with their overall counterparts . Another subgroup worth highlighting is the no-CKD/Age /Male subgroup. Though not as extreme as the no-CKD/Age /Female subgroup, the estimates of and are modestly closer to zero than the average values of these parameters, and while the credible intervals for these parameters still covered and , most of the mass of these credible intervals were to the right and left of and respectively.

Figure 2 displays posterior estimates of from the additive model. One notable difference between the additive and the saturated model is in the overall estimates and . Specifically, the estimate was much closer to zero in the additive model than in the saturated model. As with the saturated model, both the no-CKD/Age /Female and no-CKD/Age /Male subgroups stand out in the additive model due to lower magnitudes of . The estimated values of for the no-CKD/Age /Female subgroup is actually quite similar to those from the saturated model. However, due to differences between the saturated and additive models in the overall estimate , the value of for this subgroup does not appear different than the estimated overall value of this parameter.

### 5.2 Comparing Utilities across Subgroups

While forest plots of can be useful for examining variability in the effect of treatment on altering patients’ risk-benefit profiles, interpreting such graphs can be somewhat challenging. Converting the bivariate outcomes into a composite score by weighting each outcome allows one to report heterogeneity with respect to a univariate score. Weighting methods for assessing risk-benefit have been described by a number of researchers including, for example, Chuang-Stein (1994). We consider here two approaches for weighting joint patient outcomes.

A direct way of combining to produce a univariate score would be to compute

(2) |

for weights . Assuming that a patient receives a utility of when experiencing outcome , represents the difference in expected utility between treatments for those in subgroup . In other words, if one defines a composite outcome where a patient is assigned a composite of score of according to , then represents the subgroup-specific expected difference in this composite outcome.

Rather than weight joint binary outcomes as is done in computing , an alternative is to incorporate the survival time directly into the composite outcome. We define such a composite score which, for the patient, we define as

(3) |

where represents a truncation point that represents a follow-up period of interest. The composite measure may be viewed as an outcome whose expectation is a type of weighted restricted mean survival time (RMST) (see e.g., Royston & Parmar (2013)). We use rather than to reduce the impact of model extrapolation beyond the follow-up period of interest . With the measure , a patient receives a weight of for each time unit of PE-free survival assuming he/she did not experience an AE over the time interval . Likewise, a patient receives a weight of for each unit of PE-free survival time assuming he/she did experienced an adverse event at some point in the time interval . The composite outcome bears some resemblance to the Q-TWist measure described in Gelber et al. (1989) though does not explicitly distinguish between time with toxicity and time without toxicity.

The expectation of the composite outcome conditional on subgroup and treatment arm information is given by

Subgroup-specific treatment effects relative to this measure may then be defined as the difference in the expectation of in the treatment arm vs. the expectation of in the control arm

(4) |

Figure 3 shows estimates of the treatment effects using the saturated model and two choices of weights . For each choice of weights, we set , and we set to and for the left-hand and right-hand plots respectively. When , may be interpreted as the proportion of value that one receives from a unit of extra life when an AE is known to occur as compared to the value of an extra unit of life when no AE occurs. Hence, a value of very close to one implies that AE-free survival and AE-occurring survival are valued similarly while values of much lower than one imply that AE-free survival is valued much more highly than AE-occurring survival.

It is interesting to note that the younger subgroups (i.e., age ) tend to consistently receive less benefit treatment for both choices of weights. Indeed, all of the younger subgroups have posterior means less than the average value of the posterior means . In Figure 3, two subgroups stand out in terms of being different than the average value of across subgroups. These are the no-CKD/Age /Female and yes-CKD/Age /Female subgroups. Both of these subgroups do not appear to benefit when using either choice of weights to compute . This is not that surprising for the no-CKD/Age /Female subgroup as the four-outcome subgroup analysis from Figure 1 suggested the intensive treatment had very little impact in changing the probabilities of any of the four outcomes. Similarly, the point estimates from Figure 1 also suggested that the treatment had little effect on the yes-CKD/Age /Female subgroup though the posterior uncertainty for these estimates was much greater than for the no-CKD/Age /Female subgroup. Interestingly, the no-CKD/Age /Male subgroup did not show clear evidence of having a worse treatment effect than the average from the other subgroups.

### 5.3 Heterogeneity in the Probability of Outcome Improvement

While using weights to create univariate composite measures can be useful, it is often difficult to justify a particular choice of weights in an analysis. An alternative to weighting is to order the outcomes according to their desirability and to use the probability of outcome improvement as the treatment effect to be measured. An advantage of approaches which rely on ordering outcomes (see e.g., Claggett et al. (2015) or Follmann (2002)) is that they do not require that one assign numerical values to each outcome but only require that one be able to order outcomes according to their desirability.

We consider here an ordering-based definition of treatment effect that compares the outcomes of two patients randomly drawn from each treatment arm. Specifically, we consider two treatment-discordant patients and (say and ) which are chosen randomly from subgroup . If we could observe the failure times and along with and , it would be possible to determine which patient had the superior outcome. That is, we could construct a function such that when outcome is superior to and otherwise. While there are many ways of choosing , Table 3 describes our approach for ordering pairs of bivariate outcomes. Here, we rely on an indifference parameter which represents the additional gain in survival that would be needed to compensate for suffering from the AE. For example, suppose that and and that the indifference parameter is set to . In this case, patient would need an at least longer survival time than patient in order for the outcome of patient to be preferable to the outcome of patient .

Outcomes | Preferred Treatment | |
---|---|---|

Using the outcome orderings described in Table 3, we define the subgroup-specific treatment effects as

(5) |

The parameter represents the probability that patient has a superior outcome to patient minus the probability that patient has a superior outcome to patient when assuming both patients and belong to subgroup . Hence, when patients in subgroup are sure to benefit from treatment, and when patients in subgroup are certain to be harmed by treatment. Note that is a function of the subgroup-specific parameters and and thus one may easily obtain draws from the posterior distribution of by transforming samples from the posterior distribution of . More specifically, is related to the model parameters via

Figure 4 plots the posterior means of for each subgroup using both the saturated and additive models. For this analysis, we set to . This means that an individual receiving the intensive treatment would need at least times as much PE-free survival as under the standard treatment in order to be “compensated” for the fact that he/she would experience a treatment-related SAE under the intensive treatment while remaining SAE-free under the standard treatment arm. In Figure 4, the “overall” treatment effects are represented by the solid vertical lines. These are computed as where is the posterior mean of . The overall treatment effect estimates were and for the saturated and additive models respectively.

As shown in Figure 4, there is less clear variability across subgroup when using the treatment effect . Specifically, all the credible intervals for both the saturated and additive models cover the average value of the posterior means . It is interesting to note that, on this scale, the estimated treatment effect for the no-CKD/Age /Female subgroup is now very close to the estimated overall treatment effect which seems to stand in contrast to the results in Sections 5.1 and 5.3. While the width of the credible intervals in both the saturated and additive models prevents us from making any strong conclusions about this subgroup, it is worthwhile to note how the prominence of a treatment-covariate interaction can change when the treatment effect scale is changed. The differences between the saturated and additive models for the no-CKD/Age /Female and yes-CKD/Age /Female subgroups is also noteworthy. In addition to much greater shrinkage with the additive model, the difference between the estimates of in these subgroups changes sign between the saturated and additive models. This is due to the more structured borrowing of information among subgroups that is enabled by the additive model. Specifically, much information is shared among subgroups carrying the same CKD status with the no-CKD subgroups being consistently shrunken towards a larger value than the yes-CKD subgroups.

## 6 Model Checking and Diagnostics

Our approach assumes that, conditional on an AE indicator, survival follows an exponential distribution within each subgroup/treatment arm combination. We feel that such a simple parametric approach is needed due to the potentially large number of “cells” that arise from the number of subgroup/treatment arm/AE combinations where the sparse amount of within-cell data precludes more ambitious modeling. Despite this, our model treats patient survival as arising from a type of mixture-of-exponentials distribution and such a mixture can often provide a good fit to the observed overall survival patterns.

Posterior predictive checks (Meng (1994)) are a useful tool for assessing the goodness-of-fit of a Bayesian model. Such checks are performed by first simulating outcomes from the posterior predictive distribution and computing a test statistic (or a collection of test statistics) of interest for each simulation replication. The distribution of the computed test statistics across posterior predictive draws is then compared with the observed value of this test statistic. An observed value of the test statistic that seems “typical” with respect to the posterior predictive distribution of this test statistic is an indication that the model provides a good fit or, at least, is not clearly deficient in some way. Figure 5 shows treatment-specific estimated survival curves from draws from the posterior predictive distribution. Here, the test statistics of interest are the treatment-specific Kaplan-Meier estimates. As shown in Figure 5, the Kaplan-Meier estimates for the observed data seem typical when compared with the collection of Kaplan-Meier estimates from the posterior predictive draws. Indeed, for both treatment arms, the observed Kaplan-Meier estimates are roughly centered among the posterior predictive survival curves, and the shapes of the posterior predictive survival curves largely resemble the linear shape of the observed Kaplan-Meier estimates.

Beyond graphical displays of survival, one can use univariate test statistics and posterior predictive p-values as a way of more formally assessing goodness-of-fit. Posterior predictive p-values represent the probability that hypothetical replications of the test statistic from the posterior predictive distribution is more extreme than the observed value of the test statistic. It is often recommended to try test statistics which are not directly modeled by the probability model used (Gelman et al. (2014)). Figure 6 displays the posterior predictive distributions of the restricted mean survival time (RMST) using both the saturated and additive models. For the saturated model, the two-sided posterior predictive p-values are and for the standard and intensive treatment arms respectively. For the additive model, the two-sided posterior predictive p-values are and for the standard and intensive treatment arms respectively. These relatively large posterior predictive p-values do not indicate that either model has a serious flaw.

Predictive Criterion | Additive Model | Saturated Model |
---|---|---|

DIC | ||

WAIC |

Information criteria can be used to compare models such as the saturated and additive models that have differing number of parameters. The use of information criteria can be helpful whenever model checking procedures do not clearly show that one of the models is inadequate or when there is no strong a priori reason for favoring one model over the other. Two well-known information criteria that can be computed for Bayesian hierarchical models are the deviance information criterion (DIC) (Spiegelhalter et al. (2002)) and the widely-applicable information criterion (WAIC) (Watanabe (2010)). Both of these criterion are found by adding a penalty to negative two times the log-likelihood function evaluated at the posterior mean of the model parameters. The penalty is determined by the model degrees of freedom which are computed differently by DIC and WAIC. Lower values of both the DIC and WAIC imply better model performance.

As shown in Table 4, the additive model had a DIC which was roughly less than the DIC of the saturated model. Differences in the DIC of less that are often not considered meaningful (see e.g., Carlin & Louis (2009)) so this difference of provides moderate support in favor of the additive model. Moreover, the additive model may be especially preferred if a parsimonious model is desired. In contrast to the DIC, the saturated and additive models are much closer when using WAIC as the comparison measure. Nevertheless, a difference of roughly in WAIC value still provides mild support in favor of the additive model. Thus, for the SPRINT data, there does not seem to be a good justification for using the saturated model as the log-likelihood (evaluated at the posterior mean of the parameters) for both models is roughly the same while the saturated model requires considerably fewer model parameters.

## 7 Discussion

In this article, we have described a Bayesian approach for performing subgroup analysis with respect to both a primary and safety endpoint and have detailed its use in generating relevant within-subgroup risk-benefit measures. A key feature of our approach is that the benefit-safety endpoints are modeled jointly rather than used to perform separate subgroup analyses of benefit and safety.

Summary assessment of risks and benefits, i.e. combining the multiple risks and benefit outcomes into a single utility measure, is a challenging problem. In section 5.2 we have provided a few reasonable options, Eqs. 2, 4, 5. Other summary measures are certainly possible. It should be remarked that both the primary and safety endpoints used in the SPRINT trial are themselves composite endpoints. Hence, it may be worth decomposing these composite measures further for a more thorough assessment of benefit and risk (e.g., death and other cardiac events are treated the same).

## References

- (1)
- Carlin & Louis (2009) Carlin, B. P. & Louis, T. A. (2009), Bayesian Methods for Data Analysis, Chapman & Hall/CRC, Boca Raton, FL.
- Chuang-Stein (1994) Chuang-Stein, C. (1994), ‘A new proposal for benefit-less-risk analysis in clinical trials’, Control Clinical Trials 15, 30–43.
- Claggett et al. (2015) Claggett, B., Tian, L., Castagno, D. & Wei, L.-J. (2015), ‘Treatment selections using riskâbenefit profiles based on data from comparative randomized clinical trials with multiple endpoints’, Biostatistics 16(1), 60–72.
- Dixon & Simon (1991) Dixon, D. O. & Simon, R. (1991), ‘Bayesian subset analysis’, Biometrics 47, 871–881.
- Efron & Morris (1973) Efron, B. & Morris, C. (1973), ‘Stein’s estimation rule and its competitors - an empirical Bayes approach’, Journal of the American Statistical Association 68(3), 117–130.
- Evans & Follmann (2016) Evans, S. R. & Follmann, D. (2016), ‘Using outcomes to analyze patients rather than patients to analyze outcomes: A step toward pragmatism in benefit:risk evaluation’, Statistics in Biopharmaceutical Research 8(4), 386–393.
- Follmann (2002) Follmann, D. (2002), ‘Regression analysis based on pairwise ordering of patients’ clinical histories’, Statistics in Medicine 21(22), 3353–3367.
- Gelber et al. (1989) Gelber, R. D., Gelman, R. S. & Goldhirsch, A. (1989), ‘A quality-of-life oriented endpoint for comparing treatments’, Biometrics 45(3), 781–795.
- Gelman et al. (2014) Gelman, A., Carlin, J., Stern, H., Dunson, D., Vehtari, A. & Rubin, D. (2014), Bayesian Data Analysis, Third Edition (Chapman & Hall/CRC Texts in Statistical Science), third edn, Chapman and Hall/CRC, London.
- Henderson et al. (2016) Henderson, N. C., Louis, T. A., Wang, C. & Varadhan, R. (2016), ‘Bayesian analysis of heterogeneous treatment effects for patient-centered outcomes research’, Health Services and Outcomes Research Methodology 16, 213–233.
- Jones et al. (2011) Jones, H. E., Ohlssen, D. I., Neuenschwander, B., Racine, A. & Branson, M. (2011), ‘Bayesian models for subgroup analysis in clinical trials’, Clinical Trials 8, 129–143.
- Meng (1994) Meng, X. L. (1994), ‘Posterior predictive p-values’, Annals of Statistics 23(3), 1142–1160.
- Padmanabhan et al. (2012) Padmanabhan, S. K., Berry, S., Dragalin, V. & Krams, M. (2012), ‘A Bayesian dose-finding design adapting to efficacy and tolerability response’, Journal of Biopharmaceutical Statistics 22(2), 276–293.
- Royston & Parmar (2013) Royston, P. & Parmar, M. K. (2013), ‘Restricted mean survival time: an alternative to the hazard ratio for the design and analysis of randomized trials with a time-to-event outcome’, BMC Medical Research Methodology 13(1), 152.
- Spiegelhalter et al. (2002) Spiegelhalter, D. J., Best, N. G., Carlin, B. P. & ‘van der Linde’, A. (2002), ‘Bayesian measures of model complexity and fit (with discussion)’, Journal of the Royal Statistical Society B 64(4), 583–639.
- The SPRINT Research Group (2015) The SPRINT Research Group (2015), ‘A randomized trial of intensive versus standard blood-pressure control’, The New England Journal of Medicine 373(22), 2103–2116.
- Watanabe (2010) Watanabe, S. (2010), ‘Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory’, Journal of Machine Learning Research 11, 3571–3594.