Closed testing with Globaltest, with application in metabolomics

The Globaltest is a powerful test for the global null hypothesis that there is no association between a group of features and a response of interest, which is popular in pathway testing in metabolomics. Evaluating multiple feature sets, however, requires multiple testing correction. In this paper, we propose a multiple testing method, based on closed testing, specifically designed for the Globaltest. The proposed method controls the familywise error rate simultaneously over all possible feature sets, and therefore allows post hoc inference, that is, the researcher may choose feature sets of interest after seeing the data without jeopardizing error control. To circumvent the exponential computation time of closed testing, we derive a novel shortcut that allows exact closed testing to be performed on the scale of metabolomics data. An R package ctgt is available on comprehensive R archive network for the implementation of the shortcut procedure, with applications on several real metabolomics data examples.


INTRODUCTION
In high-dimensional data, features may often be meaningfully taken together in sets or groups. This is especially true in metabolomics, where metabolic pathways are sets of functionally associated metabolites. Analysis in the context of pathways provides mechanistic insights into the underlying biology (Xia et al., 2015).
Many methods have been proposed for feature set testing (Mathur et al., 2018); a popular one is Globaltest (Goeman et al., 2004), which is locally most powerful on average in a neighborhood of the null hypothesis and remains valid and powerful in high-dimensional data with more features than observations (Goeman et al., 2006). Furthermore, it adapts to the correlation structure of data. In Metabo-Analyst (Xia et al., 2015), a web-based analytical pipeline for metabolomics data, Globaltest is the default testing method for pathway analysis.
When many feature sets are tested, multiple testing correction is necessary. We follow Meijer and Goeman (2016), who argued that familywise error rate (FWER) is more appropriate for feature set testing than false discovery rate (FDR), which can be difficult to interpret when feature sets are nested. To control FWER for multiple Globaltests, several methods have been proposed based on a Bonferroni correction, such as the Focus Level (FL) procedure (Goeman and Mansmann, 2008), Directed Acyclic Graph (DAG) (Meijer and Goeman, 2015), and Structured Holm (SH; Meijer and Goeman, 2016). All these methods control FWER for a collection of feature sets that must be specified before the data were seen.
The most common feature set testing in metabolomics is pathway testing. With the development of highthroughput technologies, there have been a surge of metabolic pathway databases, such as "KEGG," "Biocyc," and "Wiki." These databases are different from each TA B L E 1 FWER for different methods (DAG, FL, and SH at level 5%) per database and simultaneously over all databases Note: The null data are simulated by permuting the 0/1 response 2000 times based on a real data set with 92 observations and 47 metabolites (Taware et al., 2018). A total of six annotation databases are used: "KEGG", "Biocyc", "SMPDB", "Biofunction", "Protein" from Metabolites Biological Role (MBROLE) (López-Ibáñez et al., 2016), and "Wiki" from "rWikiPathways" (Slenter et al., 2017). other in pathway content, structure, format, and functionality. Current practice, unfortunately, is to correct for multiple testing only within each database even when multiple databases have been used (López-Ibáñez et al., 2016). This causes an inflation of FWER when, as is common, multiple databases are explored or when feature sets of interest are selected in a data-driven way. This is illustrated in Table 1. The multiplicity issues for post hoc chosen pathway databases are often overlooked in practice. Moreover, post hoc definition of feature sets is also of interest when-as is common-feature sets overlap. When two overlapping feature sets are both significant, it is natural to follow up by looking at their intersection and set differences, though these derived feature sets are usually not in any database.
Post hoc inference, in the sense of choosing feature sets to be tested after seeing the data, is possible with closed testing (Marcus et al., 1976;Goeman and Solari, 2011). Since this method controls FWER for all possible feature sets, it allows researchers to postpone the selection of feature sets of interest after seeing the data. Goeman et al. (2021) also proved that only closed testing procedures are admissible for FWER control, that is, all other procedures either are equivalent to closed testing or can be improved using closed testing. Closed testing has been explored for pathway analysis in genomics by "SEA" (Ebrahimpoor et al., 2020), and building upon applications in neuroimaging (Rosenblatt et al., 2018). These methods use Simes test (Simes, 1986) as the local test, which is fast to implement with closed testing. However, Simes test is not an established test for feature set testing, and it is preferable to use Globaltest instead. Simes test requires assumptions on positive dependence of -values, and may be conservative when -values are strongly dependent.
In this paper, we develop a closed testing procedure using Globaltest. Our approach allows post hoc choice of feature sets, the only requirement is that the collection of all features from which features set are defined is specified a priori. The major challenge to perform closed testing is computational: it requires exponentially many tests. To speed up the closed testing procedure, we develop novel shortcuts, reducing the exponential number of Globaltests to linear. We first propose a "single-step" shortcut that is fast but approximate to the full closed testing procedure. It guarantees strong FWER control but may be conservative. To gain power, we then embed the single-step shortcut within a branch and bound algorithm, leading to an "iterative" shortcut. The iterative shortcut will approximate the full closed testing procedure closer and closer as we iterate longer, trading computation time for power, and converging eventually to the exact closed testing result. On the scale of typical metabolomics data (≈300 features), the exact closed testing result for a pathway can be obtained in seconds on a regular PC.
Although Globaltest is derived in the context of all generalized linear models we focus in this paper on logistic regression only, which is the most popular generalized linear model used with Globaltest.

THE GLOBALTEST
Suppose we have independent subjects on which features are measured. We gather the observations into a 0/1 response vector and a design matrix partitioned into an × matrix of features and an × matrix including the intercept term and potential confounders we would like to adjust for, for example, age and gender. We allow > , although we assume that < .
To denote a feature set, we will use the index set ⊂ {1, … , } of features it includes, and we will write = | | for its cardinality and for the submatrix of formed from columns indexed by . Globaltest (Goeman et al., 2004(Goeman et al., , 2006 is used to test feature set for association with the response. The Globaltest assumes the logistic model and tests the null hypothesis for an -dimensional vector , where ℎ( ) = exp( )∕(1 + exp( )) is the standard inverse logistic function and is a -dimensional vector of nuisance parameters. The Globaltest statistic for testing = under model (1) is given by where is the identity matrix of size and = ( ⊺ ) −1 ⊺ . It can be seen from (3) that the Globaltest statistic is the sum of the test statistics of the individual features in , that is, Theorem 1 in  shows that the null distribution of is asymptotically equivalent to a weighted sum of independent 2 1 variables, that is, where the weights 1 ≥ ⋯ ≥ are the eigenvalues of the positive semidefinite matrix 1∕2 ( − ) ⊺ ( − ) 1∕2 . Here, is the diagonal covariance matrix of under the null hypothesis, with entries ( | )(1 − ( | )). For a prespecified significance level , we can approximate the Globaltest critical value by the 1 − quantile of the asymptotic null distribution in (4), where we make the dependence of on the eigenvalues = ( 1 , … , ) explicit. To compute the critical value , we adopt Robbins and Pitman (1949) algorithm, as suggested in . Though it is slightly slower on average, it is numerically more stable and less vulnerable to the problem of premature convergence.
Proposition 1 shows that the Globaltest = { ≥ } is an asymptotically valid -level test. The proof of Proposition 1 and of all the following lemmas and theorems can be found in the Supporting Information.

Proposition 1.
Assume that the logistic model in (1) holds with = 0, then that is, the Globaltest has asymptotic type I error control.

CLOSED TESTING
When testing feature sets, we are interested in finding sets in which there is evidence of some association between the signal of the features in the set and the response. We suppose some features are associated with the response and some features are not associated with the response, that is, null features. As usual with Globaltest, we adopt the selfcontained paradigm for feature set testing (Goeman and Bühlmann, 2007), in which a "null-feature" set is defined as a set containing only null features. Let = {1, … , } be the set of all features, which should be fixed before seeing the data, and ⊆ the set of all null features. For any feature set ⊆ , the self-contained null hypothesis for is To allow post hoc inference, we will control FWER for the family  = 2 of all 2 possible feature sets, that is, 2 = { ∶ ⊆ }. The collection of null-feature sets is  = 2 . Our goal is to design a test procedure that rejects the collection of feature sets  ⊆  in such a way that FWER is controlled, that is, To obtain such FWER control, we will use the closed testing procedure (Marcus et al., 1976). Closed testing requires that the family of hypotheses is closed under intersection: for all , in the family we should have ∩ in the family. This is easy to check for the hypothesis family { ∶ ∈  }, since ∩ = ∪ . Closed testing is a coherent procedure (Goeman et al., 2021), that is, a null hypothesis is rejected if and only if all hypotheses that imply it are rejected by a valid level test. Formally, suppose for every ∈  , is a test of with 1 indicating rejection and 0 nonrejection. Closed testing rejects The closed testing procedure has FWER control (8) if is a valid -level test of , that is, when ( ) ≤ (Marcus et al., 1976). This generalizes to asymptotic FWER control if the test for is asymptotically valid, as we summarize in Proposition 2: Based on the discussion above, to be able to use closed testing with Globaltest (CTGT), we need to assume that the Globaltest is an asymptotically valid -level test of . We thus assume that the logistic model holds for model , that is, Note that Assumption 1 implies that a logistic regression model holds for the distribution of given . This assumption can be checked for the data at hand by using standard logistic regression diagnostics. Under this assumption, Globaltest for is an asymptotically valid level test, based on Proposition 1, and consequently FWER control (8) applies by Proposition 2. We note that we only need to assume the correct model specification for one single logistic regression model, that is, model . This is important, since it is not generally possible for several nested logistic models to be simultaneously valid (Gail et al., 1984). This robustness to model misspecification is a useful and often overlooked property of closed testing.

SINGLE-STEP SHORTCUT
A hypothesis can be rejected by closed testing if all hypotheses with ⊆ ⊆ are rejected by Globaltest at level . However, this results in exponential computational complexity of closed testing, problematic for large-scale data analysis. Shortcuts, efficient algorithms, are thus necessary to reduce computation burden (Brannath and Bretz, 2010;Gou et al., 2014;Dobriban, 2018). Shortcuts can be exact or approximate. Approximate shortcuts control FWER, but sacrifice power relative to the full closed testing procedure. In this paper, we first derive an approximate single-step shortcut and then an exact iterative shortcut for CTGT. We start with the single-step shortcut. We remark that the terminology of "single-step" should not be confused with the corresponding term in multiple testing procedures based on ordered -values.

Main idea
For any set of interest, closed testing rejects if and only if ≥ , for all ⊆ ⊆ . For illustration, we use a recurring toy example with = 100 observations, = 5 features and a binary response. Let = {1, 2, 3, 4, 5} be the index set of all features. Suppose that we want to test with = {3}. By closed testing, we have to calculate all test statistics and critical values for all 2 − hypotheses with ⊆ ⊆ . All these and are presented in Figure 1a by circles and triangles, respectively. For each pair of ( , ), if circles are all above the corresponding triangles, closed testing then rejects .
as the "level" of , the -axis in Figure 1a, the main idea of the single-step shortcut is as follows. We propose to construct a minimum test statistic line ( ) and a maximal critical value line ( ), such that for all If such ( ) and ( ) can be established, we then simply compare the two lines instead of the exponentially many pairwise comparisons. When is everywhere above , is certainly rejected by closed testing, as the following lemma says.

Lemma 1. If and satisfy (i) and (ii), respectively, then closed testing rejects at level when
In the following, we will show how to construct ( ) and ( ).

The minimum test statistic
We will construct the lower bound ( ) as the lower convex hull of all the points , for ⊆ ⊆ . We can construct the lower convex hull without evaluating all by using the additive structure of Globaltest statistics, given in (3). We have where = ⧵ and = ∕ . This simple weighted sum can be minimized for a given sum of weights by simply minimizing the 's. The support of the convex hull can therefore be found by finding the permutation { 1 , … , } of the elements of = ⧵ , with = | |, that sorts ( ) ∈ in ascending order. The supporting "bottommost" sets are given by Based on  , we formulate ( ), ∈ [ , ] as

The maximal critical value
For the maximal critical value, we need to find out a numeric vector for which the corresponding critical value is maximal among all the critical values at the same level.
As discussed in Section 2, the critical value is a function of the eigenvalue vector. To maximize the critical value, we will need to work through the eigenvalues. We first introduce the definition of majorization (Horn and Johnson, 2012): for all = 1, … , with equality for = .
Here, The majorizinĝ( ) simply takes the first few largest values of the upper bound as head and the last few smallest values of the lower bound as tail, and connecting them by an ( ) such that̂( ) is in descending order and its sum is . Obviously,̂( ) is still bounded by and , but it majorizes all other eigenvalue vectors at the same level.
We argue that the critical value computed by the majorizing vector is maximal among the ones at the same level, in terms of the following theorem in Bock et al. (1987). Theorem 1. Suppose that ≻ . Then there exists an 0 such that, for ≤ 0 , we have ( ) ≥ ( ).
A proof of Theorem 1 is in Bock et al. (1987), we only change notations. To understand the result intuitively, note that if ≻ , then ∑ =1 2 1 and ∑ =1 2 1 have the same mean, but the former has larger variance. Moreover, since ∑ =1 2 1 it puts more weight on a small number of 2 1 -variables, its tail is slightly more like that of a 2 1 , while the tail of ∑ =1 2 1 is slightly more like that of a 2 .
By combining Theorem 1 and the definition of the majorizing vector, we define the maximal critical value as ( ) = (̂( )), which has the property described in the following lemma.
In the toy example Figure 1a, given the upper bound and the lower bound with = {3}, the ( ) line and the exact critical values for all are presented as dashed line and triangle points. It is clear that ( ) is above all exact critical values. In addition to avoiding the exponentially many critical value computations, we further note that calculatinĝ( ) for all possible levels only requires calculation of eigenvalues and once. This significantly reduces the computing time especially for large matrices (ie, large ). Moreover, it is shown in Algorithm 1 in the Supporting Information, a fast algorithm for checking intersection of and , that is calculated only for a few levels, not for all levels between and .
In above lemma, we may see that the validity of depends on 0 , which has to be sufficiently large for Lemma 3 to be useful. Diaconis and Perlman (1990)  cross exactly once, implying that 0 would be far from 0 or 1. However, their conjecture was disproved by Yu (2017) who showed that the two cdfs cross an odd number of times (but sometimes more than once). However, the cdf of ∑ =1 2 1 will be always below that of ∑ =1 2 1 after the last crossing point, as Theorem 1 claims. The value of 0 in the paper is exactly tail probability corresponding to the last crossing point. Usually, practitioners would like to take significance level = 5%, which requires 5% ≤ 0 . We tested this in the real data examples with diverse sizes of hypotheses, where we find that 0 is empirically in the range of 25-30%, see the Supporting Information for more detail.

The single-step shortcut
With everything set in place, we check whether can be rejected by the single-step shortcut via checking if the minimum test statistic line is above the maximal critical value line. If ( ) > ( ), ∈ [ , ], is certainly rejected by the closed testing procedure based on Lemma 1. For example, {3} in Figure 1a is rejected by closed testing at level 5%, as the line is totally above the line indicating that all hypotheses corresponding to the supersets of {3} are rejected. Otherwise, we can turn to the conclusion that cannot be rejected so as to guarantee the FWER control. We thus summarize the "reject" and "not reject" rule of the single-step shortcut as: and do not reject otherwise. (15) A fast algorithm to efficiently check whether is totally above is presented in the Supporting Information.

Sure or unsure outcomes
While a "reject" decision by the shortcut always indicates a rejection by the full closed testing procedure, a "not reject," where ( ) < ( ) for some does not always indicate a "not reject" by the closed testing procedure: the single-step shortcut may be conservative. There is, however, an easy distinction to be made between nonrejection that certainly also correspond to nonrejections by the closed testing procedure, and nonrejections, that may or may not correspond to rejections by the closed testing procedure.
To establish this difference in case of a nonrejection by the single-step shortcut, we check the exact test statistics and exact critical values for all sets in  , the bottommost points defined in Section 4.2. If there exists a ∈  such that < , it is conclusive that closed testing does not reject . For example, {2} in Figure 1b, we find that Globaltest does not reject {24} and {245} so that {2} cannot be rejected by closed testing. On the other hand, if ≥ for all ∈  , we will be uncertain about the "not reject" of by closed testing, which is the case of {1} in Figure 1c, where we cannot determine that {1} is rejected by closed testing or not. In summary, we can expand the single-step shortcut to give three possible outcomes: "reject," "not reject," and "unsure." The unsure outcomes can be further explored by an iterative procedure that we describe in the following section.

The iterative shortcut
Clearly, the single-step shortcut is approximate in the sense that it gives at most the same rejections as the full closed testing procedure, but possibly fewer because we might get unsure outcomes. Next, we investigate how we can make it exact. If an unsure outcome is obtained from the single-step shortcut, we turn to the branch and bound algorithm of Land and Doig (1960), which is commonly used for solving NP-hard optimization problems. The branch and bound algorithm consists of two principles: a branching rule that partitions the search space into smaller subspaces and a bounding rule that is used for tracking the optimization in the subspaces and pruning those subspaces that it can prove will not contain an optional solution. Westfall and Tobias (2007) has introduced its application in closed testing with max-test, though this algorithm is otherwise unrelated to ours. We show in this paper how branch and bound can be used to reduce the conservativeness of the single-step shortcut at the expense of an increased computational burden.
Suppose that we get an unsure outcome for . This means that the line and the line intersect in the space of { ∶ ⊆ ⊆ } and ≥ for all ∈  . In terms of the branch and bound algorithm, we first split { ∶ ⊆ ⊆ } into two disjoint subspaces by distinguishing whether or not ∈ ⧵ is included: Here, is the index of the feature for which is the largest in ⧵ , as defined in Section 4.2. Second, we recalculate the line and the line separately for each subspace. If ≥ in both subspaces, we stop branching and conclude that is rejected by closed testing, as all that are split into two subspaces are all rejected by Globaltest. If there exists a subspace, say − , such that < for some ∈ − , we can stop branching and conclude that closed testing does not reject . Otherwise, there exists a subspace for which uncertainty remains. In this case, we can repeat the above steps until we get certain outcomes or we exceed the allotted computational capacity.
Illustration of the branch and bound algorithm can be seen in Figure 2, where we are unsure to reject 1 or not by closed testing. After splitting the full space into two: { ∶ ⊆ ⊆ ⧵ {3}} and { ∶ ∪ {3} ⊆ ⊆ }, and lines are recalculated in each subspace. We see in Figure 2 that > in both subspaces, thereby 1 is certainly rejected by closed testing.
Obviously, the stopping rule of the iterative shortcut can be determined by the number of iterations: how many times we iterate the single-step shortcut. The more iterations, the more power we gain. We allow user to prespecify the number of iterations to save computation time but without sacrificing FWER control. If we apply the

F I G U R E 2 Iterative shortcut rejects {1}
shortcut long enough so that no unsure outcomes left, the full closed testing solution will be obtained. Pseudocode for the iterative shortcut with 100 iterations at most is presented in Algorithm 2 in Supporting Information.
Let  be the rejection set of closed testing,  the rejection set of the iterative shortcut with iterations prespecified. Specifically, let  0 be the set of rejections by the single-step shortcut and  ∞ = lim →∞  the asymptotic rejection set of iterative shortcut. We summarize the convergence property of the iterative shortcut in Theorem 2.
There is clearly a trade-off between computing time and approaching the full closed testing when applying the branch and bound algorithm. We can trade time for power. At the worst case, the computational complexity of the iterative shortcut is still exponential. Nonetheless, its computing time of iterative shortcut is dramatically smaller in practice than that of the naive closed testing, which is computationally difficult to perform in largescale problems; we discuss the computation time of the shortcut and the full closed testing procedure in the Supporting Information.

REAL DATA APPLICATION
To investigate the power property of CTGT, we apply it to four real metabolomics data sets, whose role on regulatory pathways of human pathophysiology, ranging from aging to disease, has been highlighted. The detailed information of the data sets are listed in Table 2, named as "Eisner," "Bordbar," "Taware," and "Al-Mutawa," see more information in the Supporting Information. To use CTGT, we need to check Assumption 1. Since in this case we have no covariates and contains only the intercept term, the assumption reduces to that responses are marginally independent and identical Bernoulli.
To be able to compare CTGT with DAG, SH, and FL, we chose pathway databases of interest a priori: we applied the methods on the union of all pathways from Biocyc, KEGG, SMPDB, and WikiPathways. The first three annotation vocabularies are obtained in "MBROLE" and the last is generated by "rWikiPathways." We include individual metabolites (after removing missing values and filtering out lowly expressed metabolites) as single pathways. Information of pathways, including the total number of pathways after removing repeated ones, the mean size and the maximal size, is presented in Table 2. Since DAG, SGH, and FL were applied to the union of the pathway databases, these methods allow post hoc choice of pathways, but within the prespecified databases only. In addition, we consider closed testing with Simes test (CTST) as a competitor, which is post hoc as well.
To compare the power properties of our method with other methods, we present the total number of rejections per method on the diagonal of subtables per data set in Table 3, together with the number of shared rejections of any two methods under the diagonal. CTGT represents the exact closed testing with Globaltest, that is, the iterative shortcut without unsure outcomes, which can be achieved by setting a large enough number of iterations, we set 20,000 in this analysis. This does not mean that we have to iterate 20,000 times, since the iterative shortcut stops immediately when there is no unsure outcomes. For example, CTGT needs 2645 iterations at worst for Eisner.
It is shown in Table 3 that CTGT method discovers more pathways than its competitors for data sets Eisner and Bordbar. Especially for Bordbar with only 12 samples but 51 metabolites, the small sample size could weaken the effects of metabolites to some extent, thereby leading to good power of Globaltest. Furthermore, the small sample size influences CTGT method less than the Bonferronibased methods and CTST due to their reliance on very small tail probabilities. We note that not all results are in favor of CTGT, for example, in Taware, the low dimensionality and relatively few small-size pathways make DAG, SH, and FL powerful. Remark that we chose the pathways of interest a priori, but only CTST and CTGT retain type I error control if pathways are chosen post hoc. Further insights on comparisons between CTST and CTGT are made based on artificial data in the next section.
Results in Table 3 may also shed some light on the underlying metabolic events. For example, in data set Eisner, where researchers analyzed urine samples from patients with cancer to identify metabolites that are associated with muscle wasting, CTGT method finds totally 144 pathways are significantly associated with muscle loss. One hundred and thirty of the pathways are shared findings with other method and 14 are uniquely discovered by CTGT, for example, a KEGG pathway map00340 in class of amino acid metabolism is uniquely discovered by CTGT and not by the others. This is consistent with what they found in Eisner et al. (2011) that metabolites associated with amino acid metabolism were prominent.
To gain more insights into the property of CTGT, we present in the Supporting Information that the rejected pathways by CTGT are mainly large ones and it is more powerful with more iterations.

DISCUSSION
We have proposed a novel multiple testing procedure based on CTGT, with main applications on metabolomics annotation databases. Our method controls FWER for all possible feature subsets so that it allows the choice of feature sets of interest to be made after seeing the data. It is therefore a selective inference method (Benjamini, 2010). Still, the new method has comparable power to competing methods even when a limited number of feature sets is specified beforehand. To reduce the computational burden of closed testing, we have derived an iterative shortcut procedure. The iterative shortcut can be stopped at any point while retaining FWER control, gains power as more computation time is spent, and eventually converges to the full closed testing procedure. We have implemented both shortcuts in R package ctgt. A potential limitation of the method is that it is only valid for small significance level , that is, ≤ 0 . However, we have found that 0 is heuristically around 30%, so most values of that are used in practice are typically safe to use.
In our data analysis examples, we found that closed testing based on Simes tests (Goeman et al., 2019) was competitive, and perhaps even more robust, in terms of power to the multiple testing procedure based on Globaltest that we have developed in this paper. This is a surprising and interesting finding, as Simes tests have not seen much use in pathway testing in metabolomics. A small simulation study has been performed to compare CTGT and CTST in the Supporting Information. Further research is needed to assess the relative merits of both methods.

O P E N R E S E A R C H B A D G E S
Data and materials are available at https://dataverse. harvard.edu/dataverse/ctgt-biom.

D ATA AVA I L A B I L I T Y S TAT E M E N T
The data that support the findings in this paper are openly available in "MateboAnalyst" at https://www. metaboanalyst.ca/home.xhtml and "MetaboLights" at https://www.ebi.ac.uk/metabolights/, reference number MTBLS23, MTBLS760, and MTBLS541.