Simultaneous Confidence Intervals for Ranks With Application to Ranking Institutions

When a ranking of institutions such as medical centers or universities is based on an indicator provided with a standard error, confidence intervals should be calculated to assess the quality of these ranks. We consider the problem of constructing simultaneous confidence intervals for the ranks of means based on an observed sample. For this aim, the only available method from the literature uses Monte-Carlo simulations and is highly anticonservative especially when the means are close to each other or have ties. We present a novel method based on Tukey's honest significant difference test (HSD). Our new method is on the contrary conservative when there are no ties. By properly rescaling these two methods to the nominal confidence level, they surprisingly perform very similarly. The Monte-Carlo method is however unscalable when the number of institutions is large than 30 to 50 and stays thus anticonservative. We provide extensive simulations to support our claims and the two methods are compared in terms of their simultaneous coverage and their efficiency. We provide a data analysis for 64 hospitals in the Netherlands and compare both methods. Software for our new methods is available online in package ICRanks downloadable from CRAN. Supplementary materials include supplementary R code for the simulations and proofs of the propositions presented in this paper.


Introduction
Estimation of ranks is an important statistical problem which appears in many applications in healthcare, education and social services [Goldstein and Spiegelhalter, 1996] to compare the performance of medical centers, universities or more generally institutions.Estimates of ranks have generally a great uncertainty so that confidence intervals (CIs) become crucial [Marshall andSpiegelhalter, 1998, Goldstein andSpiegelhalter, 1996].It is surprising that inference of ranks has received little attention in the statistical literature.In applications, ranks are rarely accompanied with CIs and if so these are generally pointwise.This paper presents a first method to produce simultaneous CIs at a prespecified joint level 1 − α for the ranks with correct coverage of the true ranks.Simultaneity is important in the context of ranking estimation whenever we are not interested in a specific named institution but rather in all the institutions together.Simultaneity is also necessary to quantify the uncertainty about which institutions are ranked best, second best, etc.
To the best of our knowledge, there is not yet any valid method which produces simultaneous confidence intervals for ranks.A method which claims to produce simultaneous CIs for ranks is based on bootstrapping and was introduced by Zhang et al. [2014].Pointwise CIs for ranks using bootstrap were introduced by Goldstein and Spiegelhalter [1996] and the method was adopted later in several papers such as Marshall and Spiegelhalter [1998], Gerzoff and Williamson [2001] and Feudtner et al. [2011] among others.However, it was pointed out by Hall and Miller [2009] and Xie et al. [2009] that the bootstrap pointwise CIs fail to cover the true ranks in the presence of ties or near ties among the compared institutions.Zhang et al. [2014] used these pointwise bootstrap CIs to produce simultaneous CIs for the ranks and their method was adopted later in some recent papers such as Waldrop et al. [2017] and Moss et al. [2017].We argue that the CIs for the ranks produced by the method of Zhang et al. [2014], like the method of Marshall and Spiegelhalter [1998], might cover simultaneously at level 1 − α only when the differences among the compared institutions are very large, which in most practical situations does not hold, see Section 3 for more details.Other methods in the literature include funnel plots, see Tekkis et al. [2003], Spiegelhalter [2005] among others.Methods based on empirical Bayes approaches were also considered, see Laird and Louis [1989], Houwelingen et al. [1999], Lin et al. [2006], Lin et al. [2009], Lingsma et al. [2009] and Noma et al. [2010] among others.These two approaches although have been considered in comparing institutions, they do not aim to build CIs for ranks.Testing pairwise differences between means was also used to produce pointwise CIs for ranks [Lemmers et al., 2007, 2009, Holm, 2012, Bie, 2013].Lemmers et al. [2007] tested pairwise differences among Dutch hospitals by calculating Z-scores for their performance indicators, but they did not correct for multiple testing and thus their CIs for ranks are not simultaneous.Holm [2012] (see also Bie [2013]) calculated also a Z-score, but he applied Holm's sequential algorithm to correct for multiple comparisons on the institution level, that is for each institution he corrects for comparisons with other instituions.Nevertheless, this is only sufficient if we are interested in one of the institutions, but it is not sufficient to produce simultaneous confidence intervals for the ranks of the institutions.
Our novel method uses Tukey's honest significant difference (HSD) test [Tukey, 1953] which controls the familywise error rate (FWER).We show that Tukey's HSD can be used to produce (valid) simultaneous confidence intervals for ranks.We also introduce a new iterative procedure based on Tukey's HSD and the sequential rejection principle [Goeman and Solari, 2010].The new algorithm produces simultaneous confidence intervals for the ranks uniformly shorter than those obtained using Tukey's procedure for the same joint confidence level.
We also introduce in this paper a new rankability measure defined as the proportion of pairs of institutions that have different true performances.We estimate the true rankability by our methods (Tukey's HSD or its variant), and provide a lower confidence bound for it.
The paper is organized as follows.In Section 2, we explain the context of this paper, the notations and the objective.In Section 3, we review the bootstrap method to produce CIs for ranks.In Section 4, we revisit Tukey's HSD and show that it can be used to provide simultaneous confidence intervals for the ranks.We argue that existing improvements of Tukey's HSD cannot be used for the purpose of producing confidence intervals for the ranks.In Sections 5 and 6, we introduce a novel improvement of Tukey's HSD which can produce simultaneous CIs for ranks.Our new rankability measure is presented in Section 7. Section 8 is devoted to simulation studies showing the failure of the bootstrap method to give a correct simultaneous coverage, and for comparing Tukey's HSD to its new variant.An example of ranking Dutch hospitals is also discussed.Software for the methods presented in this paper is available in package ICRanks downloadable from CRAN.

Context and Objective
Let µ 1 , • • • , µ n be n real valued numbers which represent for example the true performance of the institutions we want to rank.Let y = (y 1 , • • • , y n ) be a sample of n independent random variables drawn from Gaussian distributions in the following manner where the standard deviations σ 1 , • • • , σ n are known whereas the centers µ 1 , • • • , µ n are unknown.The sample represents the observed performance indicators.Denote r 1 , • • • , r n the true ranks of the centers respectively which are the target of inference.Our objective is to build simultaneous CIs for these ranks.Let us first define the ranks r 1 , • • • , r n , allowing for the possibility of ties.
Definition 2.1 (ranks).We define the lower-rank of center µ i by We also define the upper-rank of center µ i by We finally define the set-rank of µ i as the set of natural numbers When the centers are all different, the ranks are calculated by counting down how many centers are below the current center and r i = l i = u i .When there are ties between the centers, we suppose that each of the tied centers possesses a set of ranks r i = [l i , u i ].For example, assume that we only have 3 centers µ 1 , µ 2 and µ 3 such that µ 1 = µ 2 < µ 3 .Then, the rank of µ 1 is the set {1, 2} and the rank of µ 2 is also the set {1, 2}, whereas the rank of µ 3 is the singleton {3}.The rationale of the definition of the set-ranks is that in case of ties, the ranking is arbitrary, and a small perturbation of the true performance may produce any rank in the set of ranks.We call the ranks induced from the observed sample y the empirical ranks.These ranks might be different from the true ranks of the centers, and since the sample is assumed to have a continuous distribution, the empirical ranks are all singletons.
We aim on the basis of the sample y to construct simultaneous confidence intervals for the set-ranks of the centers.In other words, for each i we search for a confidence interval [L i , U i ] such that: for a prespecified confidence level 1 − α.It is worth noting that the confidence intervals here are confidence intervals in N, the set of natural numbers.Two different types of statement can be obtained from the simultaneous CIs (2.4).First, for each center what are the possible ranks that it might take (which is our main objective).Second, since the confidence intervals for the ranks are simultaneous, we can deduce confidence sets for the best center(s), second best center(s), etc.These confidence sets have also a joint confidence level of at least 1 − α.Indeed, in order to find the centers that can be the best, it suffices to see who are the centers whose rank CI starts at 1.In the same way, we can look at the centers whose rank CI includes rank 2 to obtain a confidence set of the centers ranked second best and so on.
3 Simultaneous Confidence Intervals for Ranks Using Bootstrap Before we introduce our methods, we first argue why the bootstrap-based method of Zhang et al. [2014] does not provide a correct coverage when the centers are close to each other.The method of Zhang et al. [2014] is the first method in the literature that was proposed to build simultaneous confidence intervals for ranks.The method proceeds as follows.They use the bootstrap-based method introduced by Marshall and Spiegelhalter [1998] to produce pointwise CIs at level 1 − β for the ranks for several values of β in the interval (0, α).For this purpose, K bootstrap n−samples are generated.These are then used again to estimate the joint probability that the empirical ranks (the ranks of y) are inside these pointwise CIs at levels 1 − β.They choose β such that the set of pointwise CIs has the smallest estimated joint coverage superior to 1 − α.According to Zhang et al. [2014], as K increases, we should obtain simultaneous CIs with a more accurate confidence level.The authors provide a lower bound for K and advise the reader to choose a sufficiently large value.
However, the method does not have a solid theoretical assurance.Indeed, it was pointed out by several authors that bootstrap-based (pointwise) confidence intervals for ranks do not have the correct coverage when there are ties or near ties among the centers [Xie et al., 2009, Hall andMiller, 2009].Even though this concerns the bootstrap method of Marshall and Spiegelhalter [1998] on which the method of Zhang et al. [2014] is based, it should not be surprising that also the simultaneous CIs with bootstrap of Zhang et al. [2014] fail to have the correct joint confidence level.In paragraph 8.1, we show through simulations the the simultaneous CIs calculated using bootstrap do not have the correct joint confidence level unless the centers are very far from each other.In the case that all centers are almost equal and the observations have a standard error of 1, the coverage is found to be 37% instead of 95% when considering 10 centers.In applications such as comparing institutions, the differences among the institutions tend to be small, thus the use of the simultaneous bootstrap-based CIs is risky because it tends to produce too short CIs with confidence level less than 1 − α.Therefore, for inference on ranks, we advise against bootstrap-based methods.

Simultaneous Confidence Intervals for Ranks Using Tukey's HSD
Tukey's pairwise comparison procedure [Tukey, 1953] best known as the Honest Significant Difference test (HSD) is an easy way to compare means of observations with (assumed) Gaussian distributions especially in ANOVA models.The interesting point about the procedure is that it provides simultaneous confidence statements about the differences between the means and controls the FWER at level α.Moreover, it possesses certain optimality properties.In balanced one-way designs (which corresponds in our context to the situation that all σ i 's are equal), simultaneous confidence intervals for the differences have confidence level exactly 1 − α.The method is also optimal in the sense that it produces the shortest confidence intervals for all pairwise differences among all procedures that give equal-width confidence intervals at joint level at least 1 − α, see for example Hochberg and Tamhane [1987, p. 81] and Rafter et al. [2002].
The method.We consider the general case with possibly unequal σ i 's here.Tukey's HSD tests all null hypotheses H i,j : µ i − µ j = 0 at level α using the rejection region where q 1−α is the quantile of order 1 − α of the distribution of the Studentized range and Ỹ1 , • • • , Ỹn are independent centered Gaussian random variables with standard deviations In practice, a simple way to construct the confidence intervals for the ranks is to start by sorting the observations y 1 < y 2 < • • • < y n .In order to calculate the CI for µ i , it suffices to count down how many centers are not significantly different from it.The lower bound of the rank of µ i is thus obtained by counting the number of times the hypothesis µ i = µ j for j < i is not rejected, say a i , or equivalently the number of times the test statistic is below the Studentized range quantile.We then count down how many times the hypothesis Proposition 4.1.Tukey's procedure produces simultaneous confidence intervals for the ranks of centers The proof is in Appendix A.1.Suppose we have three institutions A, B and C with centers µ A , µ B , µ C respectively.Assume that we found the following 95% confidence intervals for the differences from Tukey's HSD (rounded to 1 digit) Then center A gets a confidence interval for its rank [1, 1], center B gets a confidence interval for its rank [2, 3] and center C gets a confidence interval for its rank [2,3].
Several step-down improvements on Tukey's HSD have been proposed; the most efficient and well-known is the REGWQ [Rafter et al., 2002].Instead of testing equality of (ordered) pairs of centers, the procedure tests blocks of equality of centers.
Step-down variants of Tukey's HSD control the FWER at level α, but do not provide any directional information about the relative position of the centers (no protection against type III errors), so that no information about the ranks can be derived.Step-down Tukey has not been proven to protect against type III error [Welsch, 1977] although some authors believe that it does.Therefore, we decided not to consider this approach to build CIs for ranks and worked on an alternative for which we can prove type III error control.

The Sequential Rejection Principle
To improve Tukey's HSD, we will make use of the sequential rejection principle which we briefly review here in our special case.The sequential rejection principle, introduced by Goeman and Solari [2010], gives a general and intuitive way of building a sequential (iterative) algorithm for multiple testing based on a single-step method.In the context of this paper, we can keep things simpler than the general context of the sequential principle.We suppose that we have a statistical model (P µ ) µ∈R n with P the multivariate normal model with known diagonal matrix.Consider the null hypotheses H i,j : µ i ≤ µ j for i = j.A hypothesis H i,j : µ i ≤ µ j will be represented only by the pair (i, j) and i comes first to state that µ i is the smaller center under H i,j .Denote the set of all null hypotheses (pairs) to be tested as H = {(i, j), 1 ≤ i = j ≤ n}.Note that pairs here are no longer ordered as before.A set R of hypotheses is a set of pairs corresponding to the indexes of centers in the hypotheses H i,j included in it.Depending on µ (the true vector of means), some of the hypotheses of interest are true and we write them T (µ), and the remaining are false denoted by F(µ) = H \ T (µ).Define R k ⊂ H as the set of rejected hypotheses after iteration k using the sequential algorithm.At iteration k + 1, suppose that the sequential algorithm rejects a hypothesis H i,j whenever The critical value q 1−α (R k ) may depend on H i,j , but here it will not.The sequential algorithm is built so that it verifies two conditions in order to control the FWER at level α.The critical value q 1−α (R) must verify the monotonicity condition which requires that as more hypotheses are rejected, the critical values never increase.This is equivalent to the requirement that for every R ⊆ S ⊂ H, we have q 1−α (R) ≥ q 1−α (S). (5.1) This implies that the critical values must decrease as we progress in the sequential algorithm, that is This condition is very natural because as we progress in the sequential algorithm, we should continue to reject more and more hypotheses.Since the remaining amount of hypotheses to be rejected becomes smaller, and we have fewer things to control for, the critical value needs only to adjust for the remaining relatively smaller set of hypotheses.The second condition applies on the rejection procedure used at each iteration.We must ensure that In other words, if all false hypotheses are rejected, the probability that we reject a true hypothesis is less than α.The proof of the following lemma is an adaptation of Theorem 1 in Goeman and Solari [2010].Denote R final step as the set of rejected hypotheses at convergence.
Lemma 5.1.If a sequential algorithm verifies both conditions (5.1) and (5.3), then the probability that at the final step of the sequential-rejective algorithm all rejected hypotheses are false ones exceeds In the next section, we will use the sequential rejection principle in order to build a sequentialrejective version of Tukey's procedure.

A Sequential-Rejective Variant of Tukey's HSD
Assume that at iteration k, we rejected the set of pairs R k with R 0 = ∅.Define the critical value function q 1−α (R k ) which is used to test the hypotheses H i,j : µ i ≤ µ j at iteration k + 1 provided that we have already rejected the set of pairs R k at iteration k as the quantile of the maximum max where Ỹi is a centered Gaussian random variable with variance equal to σ 2 i .Clearly, our critical value function is independent of the tested pair H i,j .It depends only on R k , the set of previously rejected pairs at iteration k.
In the first iteration of our algorithm, no rejection has not been made yet and R 0 = ∅.We test the set of all pairs H using the rejection region Note that q 1−α (R 0 ) is equal to the quantile of the Studentized range calculated in Tukey's HSD (4.2).Denote R 1 as the set of rejected pairs after the first iteration.In the second iteration, the unrejected pairs in the first iteration H \ R 1 are tested against the quantile q 1−α (R 1 ).The procedure continues until no further rejections are spotted.The remaining unrejected pairs at the final step are used to build the confidence intervals for the ranks of the centers.
Algorithm 1 A sequential rejective variant of Tukey's HSD Require: Ordered sample y 1 , • • • , y n and corresponding standard deviations For each sample, calculate the standardized maximum difference between the pairs (4.2) Calculate the 1 − α quantile denoted q (0) q = q (0) PosPairs = matrix((i, j), i > j) NegPairs = matrix((i, j), i < j) while No more rejections are made do Use the critical value q to test all observed pairs with indexes from PosPairs Update PosPairs with the set of indexes of unrejected pairs // Update the critical value Use the x (1) , • • • , x (N ) again to calculate the standardized maximum difference between the pairs (6.1) with I × J = PosPairs ∪ NegPairs.
That is, calculate the vector of values and calculate the 1 − α quantile numerically Update q with the new 1 − α quantile.end while Calculate the confidence intervals for the ranks using PosPairs In practice (see Algorithm 1), we start by ordering the observed values.Then, we have two sets of pairs of indexes; positive indexes correspond to pairs (y i , y j ) with y i > y j , and negative indexes correspond to pairs (y i , y j ) with y i < y j .All negative pairs are automatically not rejected because the test statistic is negative whereas the critical value is positive.Thus, it suffices to test positive pairs.The critical value is calculated at iteration k based on the negative pairs and the remaining unrejected positive pairs at iteration k − 1.As soon as the algorithm converges and no further rejections are obtained, the confidence intervals for the ranks are calculated based on the unrejected positive indexes.
Hypotheses µ i ≤ µ j corresponding to y i < y j are never rejected, and their importance in the testing procedure is to ensure that the empirical rankings y 1 < y 2 < • • • < y n is actually in the confidence intervals.This is in fact what is missing in the step-down Tukey and which prevents it from telling the ordering of the groups of the centers.Although the number of iterations is bounded by the number of positive pairs n(n − 1)/2, in practice the algorithm converges generally within 3 or 4 iterations so that the complexity of the algorithm is of order O(n 2 ).The worst case scenario which might never happen would result in a complexity of order O(n 4 ).We prove next that our sequential-rejective algorithm controls the FWER at level α and produces simultaneous confidence intervals for the ranks of order 1 − α.The proof is in Appendix A.2.
Lemma 6.1.Algorithm 1 fulfills the sequential rejection principle and verifies the two conditions (5.1,5.3), and thus controls the FWER at level α.
In practice (see Algorithm 1), the quantile of the maximum (A.2) is calculated numerically by generating N samples and calculating the maximum inside each one of them by considering only the indexes of unrejected pairs.Then calculate the 95% quantile of the resulting vector of maxima.In order to keep the decrease of the critical value along the steps and ensure that the monotonicity condition (5.1) holds, the N samples must be generated once and for all so that they are used in each step to calculate the critical values.
Proposition 6.1.The sequential-rejective algorithm (Algorithm 1) produces simultaneous confidence intervals for the ranks of centers µ The proof of this result is in Appendix A.3.We end this Section by a final remark concerning our variant of Tukey's HSD.The sequential-rejective algorithm 1 produces uniformly shorter confidence intervals for the ranks than Tukey's HSD.Indeed, both Tukey's HSD and our sequential-rejective algorithm start by testing against the same critical value, that is the quantile of the Studentized range (4.2).Then, our sequential algorithm tries in further steps to do more rejections by testing again the unrejected pairs against a lower critical value than the quantile of the Studentized range (4.2).

A Rankability Measure
It is useful to have a single measure that gives an impression how well we can distinguish different centers, that is how rankable they are.A set of equal centers is evidently not rankable.Therefore, this set of centers should get a rankability of 0. On the other hand, a set of totally different centers should get a rankability of 1 (or 100%) since we can rank each center.As the ranks are observed through quantities provided with uncertainty, an estimate of the "true" rankability should be considered along with a confidence interval.We will first define the estimand before we define the estimate and its CI.Assume we have n centers µ 1 ≤ • • • ≤ µ n .Some of these centers might be equal.According to our definition of ranks, equal (or tied) centers all get a set of ranks [l i , u i ] which is the same for all of them.Define the rankability R n by The normalization by n(n − 1) is necessary for the rankability R n to take values in the interval [0, 1].The sum gives the surface of the set-ranks (the light grey area in figure (1)) and the subtraction from one ensures that if the set-ranks cover the whole range of ranks, we conclude that the centers are not rankable and we say then that the set of centers have a rankability of 0. In figure (1), the true rankability is R 20 = 0.616.The surface of the region in light grey (normalized by n(n − 1)) in figure (1) can be interpreted as the probability that two centers µ i and µ j picked at random have the same rank.Therefore, our rankability measure R n can be interpreted as the probability that two centers picked at random get different ranks.
The rankability R n , since it is defined through the true set-ranks, is a parameter that may be estimated.Denote [L i , U i ] the confidence interval for the set-rank of µ i .We assume that these CIs have joint confidence level of 1 − α, that is Define the estimated rankability at level 1 − α by Due to inequality (7.1), the estimated rankability at level 1 − α underestimates the true rankability with a probability at least 1 − α.In other words In figure (1), we show the 50% simultaneous CIs for ranks calculated using Tukey's HSD on a sample of 20 centers resulting in a 50% CI for the R n which is [0.232, 1].Rn (0.5) underestimates the true rankability R n with probability at least 50%, and it thus overestimates it with probability at most 50% as well which makes from Rn (0.5) a good candidate for a conservative point estimate of R n .We also show in figure (1) the 95% simultaneous CIs produced by Tukey's HSD, and the resulting 95% CI for R n is then [0.126, 1].It is worth mentioning that the estimated rankability can also be seen as a performance (or error) index so that several methods providing simultaneous CIs can be compared based on their estimated rankability.
In the context of empirical Bayesian methods for estimating ranks, a rankability measure was proposed by Houwelingen et al. [1999].It indicates which part of variation between hospitals is due to true difference and which part is due to chance.Rankability is then computed by relating heterogeneity between the centers to uncertainty between and within the centers, see also Lingsma et al. [2009] and Henneman et al. [2014] among others.This measure is specific to the Bayesian method and cannot be used in our case, however our rankability measure is only related to the confidence intervals regardless of the method which produce them.The only requirement is that the confidence intervals are simultaneous.

Simulation Study and Real Data Analysis
In this section we provide several examples (real and simulated) to demonstrate the confidence intervals produced using our approaches from Sections 4 and 6.We also illustrate the inability of the method proposed by Zhang et al. [2014] to produce valid simultaneous confidence intervals for the ranks when the centers are relatively close to each other.Another simulated example with 50 centers shows how the sequential-rejective algorithm improves upon Tukey's HSD when the distance among the centers is large enough with respect to the standard error.Finally, we consider a dataset for patients with abdominal aneurysms from 64 hospitals in the Netherlands.We compare these hospitals according to the mortality rate at 30 days, and demonstrate that we are not able to detect any interesting differences among the hospitals.This might be because they are not truly different or because of lack of power in the data.We therefore used the type of surgery operated on the patient as an output measure which led to some clear differences among the hospitals.Conclusions about each of these experiments are given in each paragraph and some final remarks are given in the discussion section afterwards.All simulations and data analysis are done using the statistical program R Core Team [2017], and the code of the functions is available in the R package ICRanks which can be downloaded from the CRAN repository.

The simultaneous coverage of the bootstrap method
In order to investigate the simultaneous coverage of Zhang et al. [2014]'s method, we simulate data from 10 centers in four different situations with the same standard deviation (σ = 1) for all of them.We increase the range of the centers from 0.1 to 15 to illustrate situations with very close centers and others with very different ones.We added in Appendix B the R code we used and the values of the centers in each situation so that the resulting table of coverage (1) can be reproduced.For each case, observations are generated from the Gaussian distributions N (µ i , 1) for i = 1, • • • , 10.For 10 centers and a joint confidence level of 95%, the lower bound for number of bootstrap samples recommended by Zhang et al. [2014] is K = 390.In the website https://surveillance.cancer.gov/cirankwhere the authors in Zhang et al. [2014] implement their method and illustrate it on several datasets on the mortality rate in the US, the default value of K is 10 4 .We therefore use this value of K, and illustrate the coverage of the method in the above-described four different cases of centers.In order to calculate the coverage, we simulate 100 Gaussian samples in each of the situations and check if all the resulting confidence intervals contain their respective true center.The coverage is then calculated as the average of the number of times the confidence intervals cover simultaneously the true ranks which is an estimator of the true simultaneous confidence level that the method provides.We also indicate the coverage of our methods; Tukey's HSD and the sequential-rejective algorithm.
The coverage of the method of Zhang et al. [2014] is almost equal to the true confidence level only when the centers are very far from each other.When the centers are close to each other, the method fails to provide a good coverage and breaks down completely when the centers are equal.This is in line with the theoretical and simulated results shown by Xie et al. [2009] and  1: The coverage of the three methods producing simultaneous CIs calculated based on 100 simulated 10-samples.The confidence level is set to 95%.Hall and Miller [2009] for the case of individual CIs for ranks using bootstrap.Because we are interested in ranking institutions where the differences among the hospitals are generally small, the bootstrap method should not be employed because there is no guarantee that it provides a correct simultaneous confidence level.Thus, it will not be included in further comparisons.Note also that the coverage of our methods is quite high.We therefore believe that there is still a possibility to improve these methods and shorten the resulting confidence intervals.
8.2 A comparison between the CIs produced by Tukey's HSD and its sequential variant We generate a dataset of 50 centers from the Gaussian distributions N (i, σ 2 i ) for i ∈ {1, • • • , 50}.The standard deviations were generated uniformly in the interval [0.5, 1.5].We plot in figure (2) the confidence intervals for the ranks at joint level 99%.The confidence intervals are represented by colored regions instead of bars in order to facilitate the comparison between the two methods.The rankability was 0.797 for the sequential-rejective algorithm and 0.791 for Tukey's HSD.The sequential variant of Tukey's HSD provides several improvements on either the lower or the upper bounds of the centers.The improvement on the rankability is not very clear because the improvements were only 1 rank shorter for the CIs.

Hospitals in the Netherlands -DSAA dataset
We study a dataset for Dutch hospitals concerning abdominal aneurysms surgery.The study includes 9489 patients operated at 64 hospitals in the Netherlands at dates mostly between the years 2012 and 2016.The number of patients per hospital ranged from 3 to 358 with an average of 150 patients per hospital.The dataset includes the following variables • the hospital ID where the patient was treated; • the date of surgery; • the context of surgery: Elective, Urgent, Emergency; • the surgical procedure: "Endovascular", "Endovascular converted" and "Open"."Endovascular" means the patient had a minimal invasive procedure through the femoral artery in the groin."Endovasculair converted" means the surgeons first tried a minimal invasive procedure through the femoral artery in the groin, but then realized they had to do an open surgery; • a complication within 30 days (yes or no); • the mortality within 30 days (yes or no); • VpPOSSUM: a numerical score that summarizes the pre-operative state of the patient.
In order to conform to the normality assumption in our model, we excluded hospitals with small number of patients.This left us with 61 hospitals and each one of them has at least 54 patients.
We compare these hospitals according to the mortality rate within 30 days.We correct for case-mix effect with a fixed effect logistic regression model using the VpPOSSUM variable.One of the hospitals has no patients who died within 30 days after surgery.Thus, we added to all the hospitals a row of data with a virtual patient who died within 30 days after surgery and with a value of VpPOSSUM equals to the average in the corresponding hospital.This prevents the logistic regression from getting an infinite standard error for this hospital.Besides, the influence on the other hospitals is rather minor because of the relatively high number of patients in them.The simultaneous CIs at joint level 95% are illustrated in figure (3).The confidence intervals cover the whole range of ranks, and there are barely any differences among the hospitals according to the mortality rate.The rankability is about 0.002 for both methods; Tukey's HSD and our variant.This can either be normal, that is all Dutch hospitals have the same performance, or due to a low power of our methods.In order to find out, we change the output variable in the logistic regression model and correct for case-mix effects with the type of surgery as an output.We make a forest plot for the hospital effect after case-mix correction for both the mortality withing 30 days and the surgical procedure.Figure ( 4) shows that indeed the mortality rate induces very few differences among the hospitals whereas the type of surgery seems to show more differences.The resulting CIs at joint level 95% for the ranks are illustrated in figure (5) with a rankability of 0.149 for the sequential-rejective algorithm and 0.147 for Tukey's HSD.The sequential-rejective algorithm improved Tukey's HSD CIs in 5 intervals (by one rank).The spotted differences in the forest plot among the hospitals were actually reflected in the CIs for the ranks and we are able to detect differences among the hospitals.We are also able to state that for only 17 hospitals we find that they may get the first rank, and that for the remaining 44 hospitals we can confidently state they are not first rank.

Discussion
We have presented two novel methods to produce simultaneous CIs for ranks with application to ranking quality of care at hospitals.Simultaneity is important when one is interested in with joint level 95%.The performance indicator is the mortality rate, and the hospital effect is corrected for case-mix using a logistic regression.The hospitals are not distinguishable using the mortality rate.Tukey and its sequential variant gave identical results.

Mortality
Figure 4: Forest plots for the hospitals effect after case-mix correction based on the mortality rate or the surgical procedure.The hospital effects based on the mortality rate are less distinguishable (have wider CIs) than the ones based on the surgical procedure.
identifying the best hospital, or the top 5, or the worst 10.Such overall results cannot be obtained from pointwise intervals.Our approach is based on Tukey's HSD and its improvement by the sequential rejection principle.The sequential-rejective variant provides a modest but uniform improvement over the method based on Tukey's HSD.The gain becomes greater when the differences among the centers become larger.The only method in the literature (as far as we know) claiming to produce simultaneous confidence intervals for ranks is [Zhang et al., 2014].However, theory indicates that the bootstrap does not provide proper coverage (Hall and Miller [2009], Xie et al. [2009]).We have performed a simulation that confirms that the bootstrap is severely anti-conservative.
We have compared the performance of 64 Dutch hospitals on mortality rates at 30 days after surgery.The simultaneous CI for ranks turned out to be all-inclusive which means that there is insufficient evidence for ranking on 30-day mortality.However, we have also compared the hospital on their preference for one of two types of surgery and then some differences among the hospitals became quite clear, especially the extremities.For example, were able for identify 17 hospitals for first rank.
It is easy to see that the event E i implies the event {l i ≥ L i , u i ≤ U i } for any i.Thus using inequality (A.1), we may write Hence, the confidence intervals for the set-ranks [L i , U i ] have a joint level of at least 1 − α.
A.2 Proof of Lemma 6.1 The first condition (5.1) is immediately fulfilled.Indeed, if R ⊆ S, then the quantile based on R uses more pairs than the quantile based on S since the maximum is calculated based on the set of unrejected pairs which is smaller for S than it is for R. In what concerns the second condition, it is fulfilled using similar arguments to those used in the proof that Tukey's HSD produces a simultaneous statement, see Hochberg and Tamhane [1987, para. 2.1.1.1].In condition (5.3), we need to make sure that after having rejected all false null hypotheses, the probability that we reject a true one never exceeds α.Suppose that all false null hypotheses F are rejected.The critical value is calculated using the remaining hypotheses (pairs).Let T = I × J = H \ F be the indexes of the remaining unrejected pairs.Denote q T the quantile of order 1 − α of the maximum calculated based on a sample of pairs of centered Gaussian random variables for which the standard deviations correspond of course to the same pairs in T , that is max 2) The probability that we falsely reject any new pair from T verifies The third line comes of course from the definition of q T as the quantile of a maximum of Gaussian random variables (A.2).Thus, condition (5.3) is fulfilled and by the sequential rejection principle, the FWER is controlled at level α.This ends the proof.
A.3 Proof of Proposition 6.1 According to lemma 5.1, the final (at convergence) set of rejected pairs by the algorithm, say R final step , is true with probability at least 1 − α.Consider center µ i .If the hypothesis µ i ≤ µ j

Figure 1 :
Figure1: Underestimating the Rankability R n .A simulated example showing 95% and 50% simultaneous CIs for the ranks of a set of centers forming three distinct blocks.The (normalized) surface of the light grey blocks is equal to 1 − R n .The normalized surface of the 50% (95% resp.)simultaneous CIs gives an underestimation of R n with a probability 50% (95% resp.).

Figure 2 :
Figure 2: Simulation Example.A comparison between Tukey's HSD and our sequential-rejective algorithm (Algorithm 1) on a simulated dataset showing the improvement that the latter method provides over the former.The number of centers is 50 and the observed values are generated from Gaussian distributions with standard deviation equal to 1.The produced CIs have simultaneous confidence level of 99%.

Figure 3 :
Figure3: Simultaneous confidence intervals for the ranks of 61 hospitals in the Netherlands with joint level 95%.The performance indicator is the mortality rate, and the hospital effect is corrected for case-mix using a logistic regression.The hospitals are not distinguishable using the mortality rate.Tukey and its sequential variant gave identical results.

Figure 5 :
Figure 5: Simultaneous confidence intervals for the ranks of 60 hospitals in the Netherlands.Data is corrected for case-mix effect.