Simultaneous Confidence Intervals for Ranks With Application to Ranking Institutions

Simultaneous Confidence Intervals for Ranks With Application to Ranking Institutions

Diaa Al Mohamad1       Jelle J. Goeman         Erik W. van Zwet
Leiden University Medical Center, The Netherlands

Last update
11Diaa Al Mohamad is a postdoc researcher at Leiden University Medical Center (LUMC), Einthovenweg 20, 2333 ZC Leiden, The Netherlands (email: Jelle J. Goeman is a professor in Biostatistics at the LUMC (email: Erik W. van Zwet is an associate professor in medical statistics at the LUMC (email: This research is funded by the NWO VINI grant 639.072.412.

When a ranking of institutions such as medical centers or universities is based on an indicator provided with a standard error, confidence intervals should be calculated to assess the quality of these ranks. We consider the problem of constructing simultaneous confidence intervals (CIs) for the ranks of centers based on an observed sample. We present a novel method based on Tukey’s honest significant difference test (HSD) which is the first method to produce valid simultaneous CIs for ranks. Moreover, we introduce a new variant of Tukey’s HSD based on the sequential rejection principle. The new algorithm ensures familywise error control, and produces simultaneous confidence intervals for the ranks uniformly shorter than those provided by Tukey’s HSD for the same level of significance. We illustrate the method through both simulations and real data analysis from 64 hospitals in the Netherlands. Software for our new methods is available online in package ICRanks downloadable from CRAN. Supplementary materials include supplementary R code for the simulations and proofs of the propositions presented in this paper.
Keywords: Tukey’s HSD, the sequential rejection principle, rankability, bootstrap, hospitals ranking.

1 Introduction

Estimation of ranks is an important statistical problem which appears in many applications in healthcare, education and social services (Goldstein and Spiegelhalter, 1996) to compare the performance of medical centers, universities or more generally institutions. Estimates of ranks have generally a great uncertainty so that confidence intervals (CIs) become crucial (Marshall and Spiegelhalter, 1998; Goldstein and Spiegelhalter, 1996). It is surprising that inference of ranks has received little attention in the statistical literature. In applications, ranks are rarely accompanied with CIs and if so these are generally pointwise. This paper presents a first method to produce simultaneous CIs at a prespecified joint level for the ranks with correct coverage of the true ranks. Simultaneity is important in the context of ranking estimation whenever we are not interested in a specific named institution but rather in all the institutions together. Simultaneity is also necessary to quantify the uncertainty about which institutions are ranked best, second best, etc.

To the best of our knowledge, there is not yet any valid method which produces simultaneous confidence intervals for ranks. A method which claims to produce simultaneous CIs for ranks is based on bootstrapping and was introduced by Zhang et al. (2014). Pointwise CIs for ranks using bootstrap were introduced by Goldstein and Spiegelhalter (1996) and the method was adopted later in several papers such as Marshall and Spiegelhalter (1998), Gerzoff and Williamson (2001) and Feudtner et al. (2011) among others. However, it was pointed out by Hall and Miller (2009) and Xie et al. (2009) that the bootstrap pointwise CIs fail to cover the true ranks in the presence of ties or near ties among the compared institutions. Zhang et al. (2014) used these pointwise bootstrap CIs to produce simultaneous CIs for the ranks and their method was adopted later in some recent papers such as Waldrop et al. (2017) and Moss et al. (2017). We argue that the CIs for the ranks produced by the method of Zhang et al. (2014), like the method of Marshall and Spiegelhalter (1998), might cover simultaneously at level only when the differences among the compared institutions are very large, which in most practical situations does not hold, see Section 3 for more details.
Other methods in the literature include funnel plots, see Tekkis et al. (2003), Spiegelhalter (2005) among others. Methods based on empirical Bayes approaches were also considered, see Laird and Louis (1989), Houwelingen et al. (1999), Lin et al. (2006), Lin et al. (2009), Lingsma et al. (2009) and Noma et al. (2010) among others. These two approaches although have been considered in comparing institutions, they do not aim to build CIs for ranks.
Testing pairwise differences between means was also used to produce pointwise CIs for ranks (Lemmers et al., 2007, 2009; Holm, 2012; Bie, 2013). Lemmers et al. (2007) tested pairwise differences among Dutch hospitals by calculating Z-scores for their performance indicators, but they did not correct for multiple testing and thus their CIs for ranks are not simultaneous. Holm (2012) (see also Bie (2013)) calculated also a Z-score, but he applied Holm’s sequential algorithm to correct for multiple comparisons on the institution level, that is for each institution he corrects for comparisons with other instituions. Nevertheless, this is only sufficient if we are interested in one of the institutions, but it is not sufficient to produce simultaneous confidence intervals for the ranks of the institutions.

Our novel method uses Tukey’s honest significant difference (HSD) test (Tukey, 1953) which controls the familywise error rate (FWER). We show that Tukey’s HSD can be used to produce (valid) simultaneous confidence intervals for ranks. We also introduce a new iterative procedure based on Tukey’s HSD and the sequential rejection principle (Goeman and Solari, 2010). The new algorithm produces simultaneous confidence intervals for the ranks uniformly shorter than those obtained using Tukey’s procedure for the same joint confidence level.

We also introduce in this paper a new rankability measure defined as the proportion of pairs of institutions that have different true performances. We estimate the true rankability by our methods (Tukey’s HSD or its variant), and provide a lower confidence bound for it.

The paper is organized as follows. In Section 2, we explain the context of this paper, the notations and the objective. In Section 3, we review the bootstrap method to produce CIs for ranks. In Section 4, we revisit Tukey’s HSD and show that it can be used to provide simultaneous confidence intervals for the ranks. We argue that existing improvements of Tukey’s HSD cannot be used for the purpose of producing confidence intervals for the ranks. In Sections 5 and 6, we introduce a novel improvement of Tukey’s HSD which can produce simultaneous CIs for ranks. Our new rankability measure is presented in Section 7. Section 8 is devoted to simulation studies showing the failure of the bootstrap method to give a correct simultaneous coverage, and for comparing Tukey’s HSD to its new variant. An example of ranking Dutch hospitals is also discussed. Software for the methods presented in this paper is available in package ICRanks downloadable from CRAN.

2 Context and Objective

Let be real valued numbers which represent for example the true performance of the institutions we want to rank. Let be a sample of independent random variables drawn from Gaussian distributions in the following manner


where the standard deviations are known whereas the centers are unknown. The sample represents the observed performance indicators. Denote the true ranks of the centers respectively which are the target of inference. Our objective is to build simultaneous CIs for these ranks. Let us first define the ranks , allowing for the possibility of ties.

Definition 2.1 (ranks).

We define the lower-rank of center by


We also define the upper-rank of center by


We finally define the set-rank of as the set of natural numbers denoted here .

When the centers are all different, the ranks are calculated by counting down how many centers are below the current center and . When there are ties between the centers, we suppose that each of the tied centers possesses a set of ranks . For example, assume that we only have 3 centers and such that . Then, the rank of is the set and the rank of is also the set , whereas the rank of is the singleton . The rationale of the definition of the set-ranks is that in case of ties, the ranking is arbitrary, and a small perturbation of the true performance may produce any rank in the set of ranks.
We call the ranks induced from the observed sample the empirical ranks. These ranks might be different from the true ranks of the centers, and since the sample is assumed to have a continuous distribution, the empirical ranks are all singletons.

We aim on the basis of the sample to construct simultaneous confidence intervals for the set-ranks of the centers. In other words, for each we search for a confidence interval such that:


for a prespecified confidence level . It is worth noting that the confidence intervals here are confidence intervals in , the set of natural numbers.
Two different types of statement can be obtained from the simultaneous CIs (2.4). First, for each center what are the possible ranks that it might take (which is our main objective). Second, since the confidence intervals for the ranks are simultaneous, we can deduce confidence sets for the best center(s), second best center(s), etc. These confidence sets have also a joint confidence level of at least . Indeed, in order to find the centers that can be the best, it suffices to see who are the centers whose rank CI starts at 1. In the same way, we can look at the centers whose rank CI includes rank 2 to obtain a confidence set of the centers ranked second best and so on.

3 Simultaneous Confidence Intervals for Ranks Using Bootstrap

Before we introduce our methods, we first argue why the bootstrap-based method of Zhang et al. (2014) does not provide a correct coverage when the centers are close to each other. The method of Zhang et al. (2014) is the first method in the literature that was proposed to build simultaneous confidence intervals for ranks. The method proceeds as follows. They use the bootstrap-based method introduced by Marshall and Spiegelhalter (1998) to produce pointwise CIs at level for the ranks for several values of in the interval . For this purpose, bootstrap samples are generated. These are then used again to estimate the joint probability that the empirical ranks (the ranks of ) are inside these pointwise CIs at levels . They choose such that the set of pointwise CIs has the smallest estimated joint coverage superior to . According to Zhang et al. (2014), as increases, we should obtain simultaneous CIs with a more accurate confidence level. The authors provide a lower bound for and advise the reader to choose a sufficiently large value.

However, the method does not have a solid theoretical assurance. Indeed, it was pointed out by several authors that bootstrap-based (pointwise) confidence intervals for ranks do not have the correct coverage when there are ties or near ties among the centers (Xie et al., 2009; Hall and Miller, 2009). Even though this concerns the bootstrap method of Marshall and Spiegelhalter (1998) on which the method of Zhang et al. (2014) is based, it should not be surprising that also the simultaneous CIs with bootstrap of Zhang et al. (2014) fail to have the correct joint confidence level. In paragraph 8.1, we show through simulations the the simultaneous CIs calculated using bootstrap do not have the correct joint confidence level unless the centers are very far from each other. In the case that all centers are almost equal and the observations have a standard error of 1, the coverage is found to be instead of when considering 10 centers.
In applications such as comparing institutions, the differences among the institutions tend to be small, thus the use of the simultaneous bootstrap-based CIs is risky because it tends to produce too short CIs with confidence level less than . Therefore, for inference on ranks, we advise against bootstrap-based methods.

4 Simultaneous Confidence Intervals for Ranks Using Tukey’s HSD

Tukey’s pairwise comparison procedure (Tukey, 1953) best known as the Honest Significant Difference test (HSD) is an easy way to compare means of observations with (assumed) Gaussian distributions especially in ANOVA models. The interesting point about the procedure is that it provides simultaneous confidence statements about the differences between the means and controls the FWER at level . Moreover, it possesses certain optimality properties. In balanced one-way designs (which corresponds in our context to the situation that all ’s are equal), simultaneous confidence intervals for the differences have confidence level exactly . The method is also optimal in the sense that it produces the shortest confidence intervals for all pairwise differences among all procedures that give equal-width confidence intervals at joint level at least , see for example Hochberg and Tamhane (1987, p. 81) and Rafter et al. (2002).

The method.

We consider the general case with possibly unequal ’s here. Tukey’s HSD tests all null hypotheses at level using the rejection region


where is the quantile of order of the distribution of the Studentized range


and are independent centered Gaussian random variables with standard deviations respectively.
In practice, a simple way to construct the confidence intervals for the ranks is to start by sorting the observations . In order to calculate the CI for , it suffices to count down how many centers are not significantly different from it. The lower bound of the rank of is thus obtained by counting the number of times the hypothesis for is not rejected, say , or equivalently the number of times the test statistic is below the Studentized range quantile. We then count down how many times the hypothesis for is not rejected, say . The confidence interval for the rank of is then .

Proposition 4.1.

Tukey’s procedure produces simultaneous confidence intervals for the ranks of centers with joint confidence level .

The proof is in Appendix A.1. Suppose we have three institutions and with centers respectively. Assume that we found the following confidence intervals for the differences from Tukey’s HSD (rounded to 1 digit)

Then center gets a confidence interval for its rank , center gets a confidence interval for its rank and center gets a confidence interval for its rank .

Several step-down improvements on Tukey’s HSD have been proposed; the most efficient and well-known is the REGWQ (Rafter et al., 2002). Instead of testing equality of (ordered) pairs of centers, the procedure tests blocks of equality of centers. Step-down variants of Tukey’s HSD control the FWER at level , but do not provide any directional information about the relative position of the centers (no protection against type III errors), so that no information about the ranks can be derived. Step-down Tukey has not been proven to protect against type III error (Welsch, 1977) although some authors believe that it does. Therefore, we decided not to consider this approach to build CIs for ranks and worked on an alternative for which we can prove type III error control.

5 The Sequential Rejection Principle

To improve Tukey’s HSD, we will make use of the sequential rejection principle which we briefly review here in our special case. The sequential rejection principle, introduced by Goeman and Solari (2010), gives a general and intuitive way of building a sequential (iterative) algorithm for multiple testing based on a single-step method. In the context of this paper, we can keep things simpler than the general context of the sequential principle. We suppose that we have a statistical model with the multivariate normal model with known diagonal matrix. Consider the null hypotheses for . A hypothesis will be represented only by the pair and comes first to state that is the smaller center under . Denote the set of all null hypotheses (pairs) to be tested as . Note that pairs here are no longer ordered as before. A set of hypotheses is a set of pairs corresponding to the indexes of centers in the hypotheses included in it.
Depending on (the true vector of means), some of the hypotheses of interest are true and we write them , and the remaining are false denoted by . Define as the set of rejected hypotheses after iteration using the sequential algorithm. At iteration , suppose that the sequential algorithm rejects a hypothesis whenever

The critical value may depend on , but here it will not. The sequential algorithm is built so that it verifies two conditions in order to control the FWER at level . The critical value must verify the monotonicity condition which requires that as more hypotheses are rejected, the critical values never increase. This is equivalent to the requirement that for every , we have


This implies that the critical values must decrease as we progress in the sequential algorithm, that is


This condition is very natural because as we progress in the sequential algorithm, we should continue to reject more and more hypotheses. Since the remaining amount of hypotheses to be rejected becomes smaller, and we have fewer things to control for, the critical value needs only to adjust for the remaining relatively smaller set of hypotheses. The second condition applies on the rejection procedure used at each iteration. We must ensure that


In other words, if all false hypotheses are rejected, the probability that we reject a true hypothesis is less than . The proof of the following lemma is an adaptation of Theorem 1 in Goeman and Solari (2010). Denote as the set of rejected hypotheses at convergence.

Lemma 5.1.

If a sequential algorithm verifies both conditions (5.1) and (5.3), then the probability that at the final step of the sequential-rejective algorithm all rejected hypotheses are false ones exceeds

In the next section, we will use the sequential rejection principle in order to build a sequential-rejective version of Tukey’s procedure.

6 A Sequential-Rejective Variant of Tukey’s HSD

Assume that at iteration , we rejected the set of pairs with . Define the critical value function which is used to test the hypotheses at iteration provided that we have already rejected the set of pairs at iteration as the quantile of the maximum


where is a centered Gaussian random variable with variance equal to . Clearly, our critical value function is independent of the tested pair . It depends only on , the set of previously rejected pairs at iteration .
In the first iteration of our algorithm, no rejection has not been made yet and . We test the set of all pairs using the rejection region


Note that is equal to the quantile of the Studentized range calculated in Tukey’s HSD (4.2). Denote as the set of rejected pairs after the first iteration. In the second iteration, the unrejected pairs in the first iteration are tested against the quantile . The procedure continues until no further rejections are spotted. The remaining unrejected pairs at the final step are used to build the confidence intervals for the ranks of the centers.

0:  Ordered sample and corresponding standard deviations .
  Result: For each such that .
  Simulate -samples with from the Gaussian distributions
  For each sample, calculate the standardized maximum difference between the pairs (4.2)
  Calculate the quantile denoted
  PosPairs = matrix
  NegPairs = matrix
  while No more rejections are made do
     Use the critical value to test all observed pairs with indexes from PosPairs
     Update PosPairs with the set of indexes of unrejected pairs 
     // Update the critical value 
     Use the again to calculate the standardized maximum difference between the pairs (6.1) with PosPairs NegPairs.
     That is, calculate the vector of values
     and calculate the quantile numerically 
     Update with the new quantile. 
  end while
  Calculate the confidence intervals for the ranks using PosPairs
Algorithm 1 A sequential rejective variant of Tukey’s HSD

In practice (see Algorithm 1), we start by ordering the observed values. Then, we have two sets of pairs of indexes; positive indexes correspond to pairs with , and negative indexes correspond to pairs with . All negative pairs are automatically not rejected because the test statistic is negative whereas the critical value is positive. Thus, it suffices to test positive pairs. The critical value is calculated at iteration based on the negative pairs and the remaining unrejected positive pairs at iteration . As soon as the algorithm converges and no further rejections are obtained, the confidence intervals for the ranks are calculated based on the unrejected positive indexes.


Hypotheses corresponding to are never rejected, and their importance in the testing procedure is to ensure that the empirical rankings is actually in the confidence intervals. This is in fact what is missing in the step-down Tukey and which prevents it from telling the ordering of the groups of the centers.
Although the number of iterations is bounded by the number of positive pairs , in practice the algorithm converges generally within 3 or 4 iterations so that the complexity of the algorithm is of order . The worst case scenario which might never happen would result in a complexity of order .
We prove next that our sequential-rejective algorithm controls the FWER at level and produces simultaneous confidence intervals for the ranks of order . The proof is in Appendix A.2.

Lemma 6.1.

Algorithm 1 fulfills the sequential rejection principle and verifies the two conditions (5.1,5.3), and thus controls the FWER at level .

In practice (see Algorithm 1), the quantile of the maximum (A.2) is calculated numerically by generating N samples and calculating the maximum inside each one of them by considering only the indexes of unrejected pairs. Then calculate the quantile of the resulting vector of maxima. In order to keep the decrease of the critical value along the steps and ensure that the monotonicity condition (5.1) holds, the N samples must be generated once and for all so that they are used in each step to calculate the critical values.

Proposition 6.1.

The sequential-rejective algorithm (Algorithm 1) produces simultaneous confidence intervals for the ranks of centers at level .

The proof of this result is in Appendix A.3.We end this Section by a final remark concerning our variant of Tukey’s HSD. The sequential-rejective algorithm 1 produces uniformly shorter confidence intervals for the ranks than Tukey’s HSD. Indeed, both Tukey’s HSD and our sequential-rejective algorithm start by testing against the same critical value, that is the quantile of the Studentized range (4.2). Then, our sequential algorithm tries in further steps to do more rejections by testing again the unrejected pairs against a lower critical value than the quantile of the Studentized range (4.2).

7 A Rankability Measure

It is useful to have a single measure that gives an impression how well we can distinguish different centers, that is how rankable they are. A set of equal centers is evidently not rankable. Therefore, this set of centers should get a rankability of 0. On the other hand, a set of totally different centers should get a rankability of 1 (or ) since we can rank each center. As the ranks are observed through quantities provided with uncertainty, an estimate of the ”true” rankability should be considered along with a confidence interval. We will first define the estimand before we define the estimate and its CI.
Assume we have centers . Some of these centers might be equal. According to our definition of ranks, equal (or tied) centers all get a set of ranks which is the same for all of them. Define the rankability by

The normalization by is necessary for the rankability to take values in the interval . The sum gives the surface of the set-ranks (the light grey area in figure (1)) and the subtraction from one ensures that if the set-ranks cover the whole range of ranks, we conclude that the centers are not rankable and we say then that the set of centers have a rankability of 0. In figure (1), the true rankability is . The surface of the region in light grey (normalized by ) in figure (1) can be interpreted as the probability that two centers and picked at random have the same rank. Therefore, our rankability measure can be interpreted as the probability that two centers picked at random get different ranks.

The rankability , since it is defined through the true set-ranks, is a parameter that may be estimated. Denote the confidence interval for the set-rank of . We assume that these CIs have joint confidence level of , that is


Define the estimated rankability at level by

Due to inequality (7.1), the estimated rankability at level underestimates the true rankability with a probability at least . In other words

Since , the interval becomes a confidence interval for .
In figure (1), we show the simultaneous CIs for ranks calculated using Tukey’s HSD on a sample of 20 centers resulting in a CI for the which is . underestimates the true rankability with probability at least , and it thus overestimates it with probability at most as well which makes from a good candidate for a conservative point estimate of . We also show in figure (1) the simultaneous CIs produced by Tukey’s HSD, and the resulting CI for is then .

Figure 1: Underestimating the Rankability . A simulated example showing and simultaneous CIs for the ranks of a set of centers forming three distinct blocks. The (normalized) surface of the light grey blocks is equal to . The normalized surface of the ( resp.) simultaneous CIs gives an underestimation of with a probability ( resp.).

It is worth mentioning that the estimated rankability can also be seen as a performance (or error) index so that several methods providing simultaneous CIs can be compared based on their estimated rankability.

In the context of empirical Bayesian methods for estimating ranks, a rankability measure was proposed by Houwelingen et al. (1999). It indicates which part of variation between hospitals is due to true difference and which part is due to chance. Rankability is then computed by relating heterogeneity between the centers to uncertainty between and within the centers, see also Lingsma et al. (2009) and Henneman et al. (2014) among others. This measure is specific to the Bayesian method and cannot be used in our case, however our rankability measure is only related to the confidence intervals regardless of the method which produce them. The only requirement is that the confidence intervals are simultaneous.

8 Simulation Study and Real Data Analysis

In this section we provide several examples (real and simulated) to demonstrate the confidence intervals produced using our approaches from Sections 4 and 6. We also illustrate the inability of the method proposed by Zhang et al. (2014) to produce valid simultaneous confidence intervals for the ranks when the centers are relatively close to each other. Another simulated example with 50 centers shows how the sequential-rejective algorithm improves upon Tukey’s HSD when the distance among the centers is large enough with respect to the standard error.
Finally, we consider a dataset for patients with abdominal aneurysms from 64 hospitals in the Netherlands. We compare these hospitals according to the mortality rate at 30 days, and demonstrate that we are not able to detect any interesting differences among the hospitals. This might be because they are not truly different or because of lack of power in the data. We therefore used the type of surgery operated on the patient as an output measure which led to some clear differences among the hospitals. Conclusions about each of these experiments are given in each paragraph and some final remarks are given in the discussion section afterwards. All simulations and data analysis are done using the statistical program R Core Team (2017), and the code of the functions is available in the R package ICRanks which can be downloaded from the CRAN repository.

8.1 The simultaneous coverage of the bootstrap method

In order to investigate the simultaneous coverage of Zhang et al. (2014)’s method, we simulate data from 10 centers in four different situations with the same standard deviation () for all of them. We increase the range of the centers from to to illustrate situations with very close centers and others with very different ones. We added in Appendix B the R code we used and the values of the centers in each situation so that the resulting table of coverage (1) can be reproduced.
For each case, observations are generated from the Gaussian distributions for . For 10 centers and a joint confidence level of , the lower bound for number of bootstrap samples recommended by Zhang et al. (2014) is . In the website where the authors in Zhang et al. (2014) implement their method and illustrate it on several datasets on the mortality rate in the US, the default value of is . We therefore use this value of , and illustrate the coverage of the method in the above-described four different cases of centers. In order to calculate the coverage, we simulate 100 Gaussian samples in each of the situations and check if all the resulting confidence intervals contain their respective true center. The coverage is then calculated as the average of the number of times the confidence intervals cover simultaneously the true ranks which is an estimator of the true simultaneous confidence level that the method provides. We also indicate the coverage of our methods; Tukey’s HSD and the sequential-rejective algorithm.

Zhang et al. () 37 54 84 90
Tukey’s HSD 100 99 100 100
Sequential Tukey 100 99 100 100
Table 1: The coverage of the three methods producing simultaneous CIs calculated based on 100 simulated 10-samples. The confidence level is set to .

The coverage of the method of Zhang et al. (2014) is almost equal to the true confidence level only when the centers are very far from each other. When the centers are close to each other, the method fails to provide a good coverage and breaks down completely when the centers are equal. This is in line with the theoretical and simulated results shown by Xie et al. (2009) and Hall and Miller (2009) for the case of individual CIs for ranks using bootstrap. Because we are interested in ranking institutions where the differences among the hospitals are generally small, the bootstrap method should not be employed because there is no guarantee that it provides a correct simultaneous confidence level. Thus, it will not be included in further comparisons.
Note also that the coverage of our methods is quite high. We therefore believe that there is still a possibility to improve these methods and shorten the resulting confidence intervals.

8.2 A comparison between the CIs produced by Tukey’s HSD and its sequential variant

We generate a dataset of 50 centers from the Gaussian distributions for . The standard deviations were generated uniformly in the interval . We plot in figure (2) the confidence intervals for the ranks at joint level . The confidence intervals are represented by colored regions instead of bars in order to facilitate the comparison between the two methods. The rankability was for the sequential-rejective algorithm and for Tukey’s HSD.
The sequential variant of Tukey’s HSD provides several improvements on either the lower or the upper bounds of the centers. The improvement on the rankability is not very clear because the improvements were only 1 rank shorter for the CIs.

Figure 2: Simulation Example. A comparison between Tukey’s HSD and our sequential-rejective algorithm (Algorithm 1) on a simulated dataset showing the improvement that the latter method provides over the former. The number of centers is 50 and the observed values are generated from Gaussian distributions with standard deviation equal to 1. The produced CIs have simultaneous confidence level of .

8.3 Hospitals in the Netherlands - DSAA dataset

We study a dataset for Dutch hospitals concerning abdominal aneurysms surgery. The study includes 9489 patients operated at 64 hospitals in the Netherlands at dates mostly between the years 2012 and 2016. The number of patients per hospital ranged from 3 to 358 with an average of 150 patients per hospital. The dataset includes the following variables

  • the hospital ID where the patient was treated;

  • the date of surgery;

  • the context of surgery: Elective, Urgent, Emergency;

  • the surgical procedure: ”Endovascular”, ”Endovascular converted” and ”Open”. ”Endovascular” means the patient had a minimal invasive procedure through the femoral artery in the groin. ”Endovasculair converted” means the surgeons first tried a minimal invasive procedure through the femoral artery in the groin, but then realized they had to do an open surgery;

  • a complication within 30 days (yes or no);

  • the mortality within 30 days (yes or no);

  • VpPOSSUM: a numerical score that summarizes the pre-operative state of the patient.

In order to conform to the normality assumption in our model, we excluded hospitals with small number of patients. This left us with 61 hospitals and each one of them has at least 54 patients. We compare these hospitals according to the mortality rate within 30 days. We correct for case-mix effect with a fixed effect logistic regression model using the VpPOSSUM variable. One of the hospitals has no patients who died within 30 days after surgery. Thus, we added to all the hospitals a row of data with a virtual patient who died within 30 days after surgery and with a value of VpPOSSUM equals to the average in the corresponding hospital. This prevents the logistic regression from getting an infinite standard error for this hospital. Besides, the influence on the other hospitals is rather minor because of the relatively high number of patients in them. The simultaneous CIs at joint level are illustrated in figure (3).

Figure 3: Simultaneous confidence intervals for the ranks of 61 hospitals in the Netherlands with joint level . The performance indicator is the mortality rate, and the hospital effect is corrected for case-mix using a logistic regression. The hospitals are not distinguishable using the mortality rate. Tukey and its sequential variant gave identical results.

The confidence intervals cover the whole range of ranks, and there are barely any differences among the hospitals according to the mortality rate. The rankability is about for both methods; Tukey’s HSD and our variant. This can either be normal, that is all Dutch hospitals have the same performance, or due to a low power of our methods. In order to find out, we change the output variable in the logistic regression model and correct for case-mix effects with the type of surgery as an output. We make a forest plot for the hospital effect after case-mix correction for both the mortality withing 30 days and the surgical procedure. Figure (4) shows that indeed the mortality rate induces very few differences among the hospitals whereas the type of surgery seems to show more differences.

Figure 4: Forest plots for the hospitals effect after case-mix correction based on the mortality rate or the surgical procedure. The hospital effects based on the mortality rate are less distinguishable (have wider CIs) than the ones based on the surgical procedure.

The resulting CIs at joint level for the ranks are illustrated in figure (5) with a rankability of 0.149 for the sequential-rejective algorithm and 0.147 for Tukey’s HSD. The sequential-rejective algorithm improved Tukey’s HSD CIs in 5 intervals (by one rank).
The spotted differences in the forest plot among the hospitals were actually reflected in the CIs for the ranks and we are able to detect differences among the hospitals. We are also able to state that for only 17 hospitals we find that they may get the first rank, and that for the remaining 44 hospitals we can confidently state they are not first rank.

Figure 5: Simultaneous confidence intervals for the ranks of 60 hospitals in the Netherlands. Data is corrected for case-mix effect.

9 Discussion

We have presented two novel methods to produce simultaneous CIs for ranks with application to ranking quality of care at hospitals. Simultaneity is important when one is interested in identifying the best hospital, or the top 5, or the worst 10. Such overall results cannot be obtained from pointwise intervals.

Our approach is based on Tukey’s HSD and its improvement by the sequential rejection principle. The sequential-rejective variant provides a modest but uniform improvement over the method based on Tukey’s HSD. The gain becomes greater when the differences among the centers become larger.

The only method in the literature (as far as we know) claiming to produce simultaneous confidence intervals for ranks is (Zhang et al., 2014). However, theory indicates that the bootstrap does not provide proper coverage (Hall and Miller (2009); Xie et al. (2009)). We have performed a simulation that confirms that the bootstrap is severely anti-conservative.

We have compared the performance of 64 Dutch hospitals on mortality rates at 30 days after surgery. The simultaneous CI for ranks turned out to be all-inclusive which means that there is insufficient evidence for ranking on 30-day mortality. However, we have also compared the hospital on their preference for one of two types of surgery and then some differences among the hospitals became quite clear, especially the extremities. For example, were able for identify 17 hospitals for first rank.

Appendix A Proofs of Propositions

a.1 Proof of proposition 4.1

From Tukey’s procedure, we can obtain simultaneous confidence intervals for the differences between the centers at level , that is for , see Hochberg and Tamhane (1987, sec. 2.1). In other words, we have


Denote the confidence interval for the difference in the previous display. Define also and . Let . It is easy to see that the event implies the event for any . Thus using inequality (A.1), we may write

Hence, the confidence intervals for the set-ranks have a joint level of at least .

a.2 Proof of Lemma 6.1

The first condition (5.1) is immediately fulfilled. Indeed, if , then the quantile based on uses more pairs than the quantile based on since the maximum is calculated based on the set of unrejected pairs which is smaller for than it is for . In what concerns the second condition, it is fulfilled using similar arguments to those used in the proof that Tukey’s HSD produces a simultaneous statement, see Hochberg and Tamhane (1987, para. In condition (5.3), we need to make sure that after having rejected all false null hypotheses, the probability that we reject a true one never exceeds . Suppose that all false null hypotheses are rejected. The critical value is calculated using the remaining hypotheses (pairs). Let be the indexes of the remaining unrejected pairs. Denote the quantile of order of the maximum calculated based on a sample of pairs of centered Gaussian random variables for which the standard deviations correspond of course to the same pairs in , that is


The probability that we falsely reject any new pair from verifies

The third line comes of course from the definition of as the quantile of a maximum of Gaussian random variables (A.2). Thus, condition (5.3) is fulfilled and by the sequential rejection principle, the FWER is controlled at level . This ends the proof.

a.3 Proof of Proposition 6.1

According to lemma 5.1, the final (at convergence) set of rejected pairs by the algorithm, say , is true with probability at least . Consider center . If the hypothesis is not true, then the true lower-rank of which is is at least . Therefore, if is true, then due to (6.3) we have

On the other hand, if the hypothesis is not true, then the true upper-rank of which is is at most . Therefore, if is true, then due to (6.4) we have

Finally, if is true, then the events for all are also true. In other words,

Appendix B R Code for the Calculus of the Coverage of Zhang et al.’s Algorithm

b.1 Calculating the coverage

# TrueCenters = c(0.017, 0.020, 0.023, 0.029, 0.036, 0.039, 0.048, 0.077, 0.086, 0.089)
#TrueCenters = c(0.003, 0.242, 0.444, 0.457, 0.682, 0.691, 0.786, 0.866, 0.920, 0.953)
#TrueCenters = c(0.189, 0.828, 1.969, 1.996, 2.048, 2.184, 2.253, 5.268, 5.739, 6.201)
TrueCenters = c(1.512, 1.764, 1.853, 3.020, 3.154, 4.895, 5.419, 7.468, 10.521, 13.054)

# Take a subset and generate the data
alpha = 0.05; sigma = rep(1,n)

K = 10^4
coverage = 100
coverageTuk = 100
coverageSeqTuk = 100
for(i in 1:100)
#set.seed(i*37833) # For the first case
#set.seed(i*37835) # For the second case
#set.seed(i*37837) # For the third case
set.seed(i*37831) # For the fourth case
y = as.numeric(sapply(1:n, function(ll) rnorm(1,TrueCenters[ll],sd=sigma[ll])))
ind =, index.return = T)$ix
y = y[ind]
resZhang = BootstrCIs(y, sigma, alpha = 0.05, N = K, K = K, maxiter = 10)
resTukey = ic.ranks(y, sigma, Method = "Tukey", alpha = 0.05)
resTukeySeq = ic.ranks(y, sigma, Method = "SeqTukey", alpha = 0.05)

if(sum(ind<resZhang$Lower | ind>resZhang$Upper)>0)
Ψcoverage = coverage - 1
if(sum(ind<resTukey$Lower | ind>resTukey$Upper)>0)
ΨcoverageTuk = coverageTuk - 1
if(sum(ind<resTukeySeq$Lower | ind>resTukeySeq$Upper)>0)
ΨcoverageSeqTuk = coverageSeqTuk - 1

b.2 An R function to calculate the CIs for the ranks according to Zhang et al.’s method

BootstrCIs = function(y, sigma, alpha=0.05, N = 10^4, K = N, precision = 1e-6,
ΨΨΨΨΨΨΨmaxiter = 50)
# A function which calculates the individual CIs at level beta
Spiegelhalter = function(mus,ses,beta, N = 10^4)
#x=ses*matrix(rnorm(N*k),nrow=k) + mus
n = length(y)
beta1 = 0; beta2 = alpha
beta = (beta2 + beta1) / 2
x=sigma*matrix(rnorm(K*n),nrow=n) + y
InitCIs = Spiegelhalter(y, sigma, alpha, N)
counter = 0; coverage = K
while(abs(beta1 - beta2)>precision | counter<=maxiter)

 # Generate individual CIs at level beta
 res = Spiegelhalter(y, sigma, beta, N)
 # Check the coverage
 coverage = K
 for(j in 1:K)
Ψind = rank(x[,j])
Ψif(sum(ind<res$lower | ind>res$upper)>0) coverage = coverage - 1
 if(coverage/K >= 1-alpha)
Ψbeta1 = beta
Ψbeta2 = beta
  beta = (beta2 + beta1) / 2

 counter = counter + 1
# print(counter)
if(coverage/K < 1-alpha) beta = beta1
res = Spiegelhalter(y, sigma, beta, N)
return(list(Lower = res$lower, Upper = res$upper, coverage = coverage/K))


  • Bie [2013] Ting Bie. Confidence intervals for ranks: Theory and applications in binomial data. Master’s thesis, Uppsala University, Sweden, 2013. Master thesis under the supervision of R. Larsson.
  • Feudtner et al. [2011] Chris Feudtner, Jay G. Berry, Gareth Parry, Paul Hain, Rustin B. Morse, Anthony D. Slonim, Samir S. Shah, and Matt Hall. Statistical uncertainty of mortality rates and rankings for children’s hospitals. Pediatrics, 128(4):e966–e972, 2011. doi: 10.1542/peds.2010-3074.
  • Gerzoff and Williamson [2001] Robert B. Gerzoff and G. David Williamson. Who’s number one? the impact of variability on rankings based on public health indicators. Public Health Reports (1974-), 116(2):158–164, 2001.
  • Goeman and Solari [2010] Jelle J. Goeman and Aldo Solari. The sequential rejection principle of familywise error control. Ann. Statist., 38(6):3782–3810, 12 2010.
  • Goldstein and Spiegelhalter [1996] Harvey Goldstein and David J. Spiegelhalter. League tables and their limitations: Statistical issues in comparisons of institutional performance. Journal of the Royal Statistical Society. Series A (Statistics in Society), 159(3):385–443, 1996.
  • Hall and Miller [2009] Peter Hall and Hugh Miller. Using the bootstrap to quantify the authority of an empirical ranking. Ann. Statist., 37(6B):3929–3959, 12 2009.
  • Henneman et al. [2014] Daniel Henneman, Annelotte C. M. van Bommel, Alexander Snijders, Heleen S. Snijders, Rob A. E. M. Tollenaar, Michel W. J. M. Wouters, and Marta Fiocco. Ranking and rankability of hospital postoperative mortality rates in colorectal cancer surgery. Annals of Surgery, pages 844–849, 2014.
  • Hochberg and Tamhane [1987] Y. Hochberg and A.C. Tamhane. Multiple comparison procedures. Wiley series in probability and mathematical statistics: Applied probability and statistics. Wiley, 1987.
  • Holm [2012] S. Holm. Confidence intervals for ranks. Department of Mathematical Statistics, 2012. Unpublished manucript.
  • Houwelingen et al. [1999] Hans C. van Houwelingen, Ronald Brand, and Thomas A. Louis. Empirical bayes methods for monitoring health care quality. 1999. unpublished manuscript.
  • Laird and Louis [1989] Nan M. Laird and Thomas A. Louis. Empirical bayes ranking methods. Journal of Educational Statistics, 14(1):29–46, 1989.
  • Lemmers et al. [2007] Oscar Lemmers, Jan A.M.Kremer, and George F.Borm. Incorporating natural variation into IVF clinic league tables. Human Reproduction, 22(5):1359–1362, 2007.
  • Lemmers et al. [2009] Oscar Lemmers, Mireille Broeders, André Verbeek, Gerard Den Heeten, Roland Holland, and George F Borm. League tables of breast cancer screening units: Worst-case and best-case scenario ratings helped in exposing real differences between performance ratings. Journal of Medical Screening, 16(2):67–72, 2009. PMID: 19564518.
  • Lin et al. [2006] Rongheng Lin, Thomas A. Louis, Susan M. Paddock, and Greg Ridgeway. Loss function based ranking in two-stage, hierarchical models. Bayesian Anal., 1(4):915–946, 12 2006.
  • Lin et al. [2009] Rongheng Lin, Thomas A. Louis, Susan M. Paddock, and Greg Ridgeway. Ranking usrds provider specific smrs from 1998–2001. Health Services and Outcomes Research Methodology, 9(1):22–38, 2009.
  • Lingsma et al. [2009] Hester F. Lingsma, Marinus JC Eijkemans, and Ewout W. Steyerberg. Incorporating natural variation into ivf clinic league tables: The expected rank. BMC Medical Research Methodology, 9(1):53, 2009.
  • Marshall and Spiegelhalter [1998] E. Clare Marshall and David J. Spiegelhalter. Reliability of league tables of in vitro fertilisation clinics: retrospective analysis of live birth rates. BMJ : British Medical Journal, 316:1701–1705, 1998.
  • Moss et al. [2017] Jennifer L. Moss, Benmei Liu, and Li Zhu. Comparing percentages and ranks of adolescent weight-related outcomes among u.s. states: Implications for intervention development. Preventive Medicine, 105:109 – 115, 2017.
  • Noma et al. [2010] Hisashi Noma, Shigeyuki Matsui, Takashi Omori, and Tosiya Sato. Bayesian ranking and selection methods using hierarchical mixture models in microarray studies. Biostatistics, 11(2):281, 2010.
  • R Core Team [2017] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2017. URL
  • Rafter et al. [2002] John A. Rafter, Martha L. Abell, and James P. Braselton. Multiple comparison methods for means. SIAM Review, 44(2):259–278, 2002.
  • Spiegelhalter [2005] David J. Spiegelhalter. Funnel plots for comparing institutional performance. Statistics in Medicine, 24(8):1185–1202, 2005.
  • Tekkis et al. [2003] Paris P Tekkis, Peter McCulloch, Adrian C Steger, Irving S Benjamin, and Jan D Poloniecki. Mortality control charts for comparing performance of surgical units: validation study using hospital mortality data. BMJ, 326(7393):786, 2003.
  • Tukey [1953] J. W. Tukey. The problem of multiple comparisons. The Collected Works of John W. Tukey VIII. Multiple Comparisons: 1948 - 1983, pages 1–300, 1953. Unpublished manuscript.
  • Waldrop et al. [2017] Anne R. Waldrop, Jennifer L. Moss, Benmei Liu, and Li Zhu. Ranking states on coverage of cancer-preventing vaccines among adolescents: The influence of imprecision. Public Health Reports, page 0033354917727274, 2017. PMID: 28854349.
  • Welsch [1977] Roy E. Welsch. Stepwise multiple comparison procedures. Journal of the American Statistical Association, 72(359):566–575, 1977.
  • Xie et al. [2009] Minge Xie, Kesar Singh, and Cun-Hui Zhang. Confidence intervals for population ranks in the presence of ties and near ties. Journal of the American Statistical Association, 104(486):775–788, 2009.
  • Zhang et al. [2014] Shunpu Zhang, Jun Luo, Li Zhu, David G. Stinchcomb, Dave Campbell, Ginger Carter, Scott Gilkeson, and Eric J. Feuer. Confidence intervals for ranks of age-adjusted rates across states or counties. Statistics in Medicine, 33(11):1853–1866, 2014.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description