Genome-wide association study identifies 32 novel breast cancer susceptibility loci from overall and subtype-specific analyses

Breast cancer susceptibility variants frequently show heterogeneity in associations by tumor subtype1–3. To identify novel loci, we performed a genome-wide association study including 133,384 breast cancer cases and 113,789 controls, plus 18,908 BRCA1 mutation carriers (9,414 with breast cancer) of European ancestry, using both standard and novel methodologies that account for underlying tumor heterogeneity by estrogen receptor, progesterone receptor and human epidermal growth factor receptor 2 status and tumor grade. We identified 32 novel susceptibility loci (P < 5.0 × 10−8), 15 of which showed evidence for associations with at least one tumor feature (false discovery rate < 0.05). Five loci showed associations (P < 0.05) in opposite directions between luminal and non-luminal subtypes. In silico analyses showed that these five loci contained cell-specific enhancers that differed between normal luminal and basal mammary cells. The genetic correlations between five intrinsic-like subtypes ranged from 0.35 to 0.80. The proportion of genome-wide chip heritability explained by all known susceptibility loci was 54.2% for luminal A-like disease and 37.6% for triple-negative disease. The odds ratios of polygenic risk scores, which included 330 variants, for the highest 1% of quantiles compared with middle quantiles were 5.63 and 3.02 for luminal A-like and triple-negative disease, respectively. These findings provide an improved understanding of genetic predisposition to breast cancer subtypes and will inform the development of subtype-specific polygenic risk scores. Genome-wide analysis identifies 32 loci associated with breast cancer susceptibility, accounting for estrogen receptor, progesterone receptor and human epidermal growth factor receptor 2 status and tumor grade.

with genotyping data from one of two Illumina genome-wide custom arrays. In analyses of overall breast cancer, we also included summary-level data from 11 other breast cancer GWASs (14,910 cases and 17,588 controls) without subtype information. Our study expands on previous BCAC GWASs 1 , with additional data on 10,407 cases and 7,815 controls-an approximate increase of 10 and 9%, respectively (Supplementary Tables 1-4).
The statistical methods are further described in the Methods and Extended Data Fig. 1. To identify variants for overall breast cancer (invasive, in situ or unknown invasiveness) in BCAC, we used standard logistic regression to estimate odds ratios (ORs) and 95% confidence intervals (95% CIs), adjusting for country and principal components. iCOGS and OncoArray data were evaluated separately and the results were combined with those from the 11 other GWASs using fixed-effects meta-analysis.
To identify breast cancer susceptibility variants displaying evidence of heterogeneity, we used a novel score test based on a two-stage polytomous model 4 that allows flexible, yet parsimonious, modeling of associations in the presence of underlying heterogeneity by estrogen receptor, progesterone receptor, HER2 and/or grade (Methods and Supplementary Note). The model handles missing tumor characteristic data by implementing an efficient expectation-maximization algorithm 4,7 . These analyses were restricted to BCAC controls and invasive cases (Methods). We fit an additional two-stage model to estimate case-control ORs and 95% CIs between the variants and intrinsic-like subtypes defined by combinations of estrogen receptor, progesterone receptor, HER2 and grade 8 (5) triple-negative or basal-like. We analyzed iCOGS and OncoArray data separately, adjusting for principal components and age, and meta-analyzed the results using a fixed-effects model. We evaluated the effect of country using a leave-one-out sensitivity analysis (Methods).
Among BRCA1 mutation carriers who are prone to developing triple-negative disease 9 , we estimated per-allele hazard ratios within a retrospective cohort analysis framework. We assumed that estimated ORs for BCAC triple-negative cases and estimated hazard ratios from CIMBA BRCA1 carriers approximated the same underlying relative risk 9 , and we used a fixed-effects meta-analysis to combine these results (Methods). Among all novel variants, we used the two-stage polytomous model to test for heterogeneity in associations across subtypes, globally and by tumor-specific markers (Methods).

Genome-wide association study identifies 32 novel breast cancer susceptibility loci from overall and subtype-specific analyses
A full list of authors and affiliations appears at the end of the paper.
Letters Nature GeNetics polytomous model (eight of which were not detected by standard logistic regression) and three variants in the CIMBA/BCAC triple-negative meta-analysis (rs78378222 was also detected by the two-stage polytomous model in BCAC). Fourteen additional variants (P < 5.0 × 10 −8 ) were excluded: 13 because they lacked evidence of association independent of known susceptibility variants in conditional analyses (P ≥ 1.0 × 10 −6 ; Supplementary Tables 8-10) and one (chr22:40042814) for showing a high degree of sensitivity in the leave-one-out country analysis following exclusion of studies from the United States ( Supplementary Fig. 6). Supplementary Figs. 7 and 8 and Supplementary Table 11 show associations between all 32 variants and the intrinsic-like subtypes.
Candidate causal variants (CCVs) were defined (Methods) for each novel locus and we investigated the CCVs in relation to previously annotated enhancers in primary breast cells 12 . Based on combinations of H3K4me1 and H3K27ac histone modification chromatin immunoprecipitation sequencing (ChIP-seq) signals, putative enhancers in basal cells, luminal progenitor cells and mature luminal cells were characterized as off, primed or active (Methods). We defined switch enhancers as those exhibiting different characterizations between cell types. Among the five loci identified with associations in opposite directions between subtypes, at least one CCV per locus overlapped a switch enhancer (Fig. 4). For example, rs78378222 overlapped an active enhancer in basal cells, a primed enhancer in luminal progenitor cells and an off

Letters
Nature GeNetics enhancer in mature luminal cells. In comparison, 63% of the loci with consistent direction of associations across subtypes also overlapped with a switch enhancer (Supplementary Tables 13 and 14). These results suggest that some variants may modulate enhancer activity in a cell type-specific manner, thus differentially influencing the risk of tumor subtypes. We used INQUISIT to intersect CCVs with functional annotation data from public databases to identify potential target genes 1 (Supplementary Note and Supplementary Table 15). We predicted 179 unique target genes for 26 of the 32 independent signals.
Notably, rs78378222 has been reported to be associated with TP53 messenger RNA levels in blood and adipose tissue 11 , which we did not replicate in breast tissue. However, our findings of rs78378222 overlapping a cell type-specific regulatory element in breast basal epithelial cells implicates enhancer function as another potential TP53 transcriptional control mechanism. Twenty-three target genes in 14 regions were predicted with high confidence (designated level 1), of which 22 target genes in 13 regions were predicted to be distally regulated. Four target genes were previously predicted by INQUISIT 13  Genetic variance corresponds to heritability on the frailty scale, which assumes the polygenetic log-additive model as the underlying model. b Susceptibility variants included 178 previoulsy identified variants 1,2 and 32 variants newly identified in this paper. c The genetic variance of all reliably genome-wide imputable variants was estimated through linkage disequilibrium score regression, as described in refs. 18,19 . Under the frailty scale, the genetic variance for all GWAS variants was characterized by population variance of the underlying true PRS as σ 2 GWAS ¼ Var

Nature GeNetics
(a known somatic breast cancer driver gene 15 ), along with genes implicated by transcriptome-wide association studies (LINC00886 (ref. 16 ) and YBEY 17 ). We used linkage disequilibrium score regression to investigate genetic correlations 18,19 between subtypes and compare enrichment of genomic features 20 between luminal A-like and triple-negative subtypes (Methods). All subtypes were moderately to highly correlated, with luminal A-like and triple-negative having a correlation of 0.46 (s.e. = 0.05). The correlation between breast cancer in BRCA1 carriers and triple-negative breast cancers in BCAC participants was 0.83 (s.e. = 0.08), suggesting a high degree of similarity in the genetic basis between these subtypes (Fig. 5 and Supplementary Table 16). To compare genomic enrichment, we first evaluated 53 annotations and found that triple-negative tumors were most enriched for 'super-enhancers, extend500bp' (3.04-fold; P = 3.3 × 10 −6 ) and 'digital genomic footprint, extend500bp' (from DNase hypersensitive sites) (2.2-fold; P = 4.0 × 10 −4 ); however, no annotations significantly differed between luminal A-like and triple-negative tumors (Supplementary Table 17 and Supplementary Fig. 9). On investigation of cell-specific enrichment of the histone markers H3K4me1, H3K3me3, H3K9ac and H3K27ac (Supplementary Note), we found both luminal A and triple-negative subtypes enriched for gastrointestinal cell types and suppression of central nervous system cell types ( Supplementary Fig. 10).
These analyses show the benefit of combining standard GWAS methods with methods accounting for underlying tumor heterogeneity. Moreover, these methods and results may help to clarify mechanisms predisposing to specific molecular subtypes, and provide precise risk estimates for subtypes to inform the development of subtype-specific PRSs 22 . However, to expand the generalizability of our findings, these analyses should be replicated and expanded in multi-ancestry populations.

Online content
Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/s41588-020-0609-2.  Table 1). Most of the studies were case-control studies in the general population or hospital setting, or nested within population-based cohorts, but a subset of studies oversampled cases with a family history of the disease. We included controls, cases of invasive breast cancer, cases of carcinoma in situ and cases of unknown invasiveness. Information on clinicopathological characteristics was collected for the individual studies and combined in a central database after quality control checks. We used the BCAC database version 10 for these analyses. Among a subset of participants (n = 16,766) who were genotyped on both the iCOGS and OncoArray arrays, we kept only the OncoArray data.
One study contributing to the iCOGS dataset (Leuven Multidisciplinary Breast Centre) was excluded due to inflation of the test statistics that was not corrected by adjustment for the first ten principal components. We also excluded OncoArray data from Norway (the Norwegian Breast Cancer Study) because there were no controls available from Norway with OncoArray data. All participating studies were approved by their appropriate ethics or institutional review board and all participants provided informed consent. The total sample size for this analysis, including iCOGS, OncoArray and other GWAS data, comprised 133,384 cases and 113,789 controls.
In the GWAS analyses accounting for underlying heterogeneity according to estrogen receptor, progesterone receptor, HER2 and grade, we included genotyping data from 81 BCAC studies. These analyses were restricted to controls and cases of invasive breast cancer. We excluded cases of carcinoma in situ and cases with missing information on invasiveness, as ~96% of in situ cases were missing some or all of the tumor markers and in situ cases potentially have different tumor characteristic correlations compared with invasive cases, which could potentially bias the estimates from the expectation-maximization based missing data handling algorithm (Supplementary Table 2). We also excluded all studies from a specific country if there were no controls for that country, or if the tumor marker data were missing on two or more of the tumor marker subtypes (see the footnote of Supplementary Table 2 for a further explanation of the excluded studies). We did not include the summary results from the 14,910 cases and 17,588 controls from the 11 other GWASs in subtype analyses because these studies did not provide data on tumor characteristics. We also excluded invasive cases (n = 293) and controls (n = 4,285) with missing data on age at diagnosis or age at enrollment (included in the expectation-maximization algorithm to better impute missing tumor characteristics). In total, the final sample for the two-stage polytomous logistic regression comprised 106,278 invasive cases and 91,477 controls.
Participants included from CIMBA were women of European ancestry aged 18 years or older with a pathogenic BRCA1 variant. Most participants were sampled through cancer genetics clinics. In some instances, multiple members of the same family were enrolled. OncoArray genotype data were available from 58 studies from 24 countries. Following quality control and the removal of participants who overlapped with the BCAC OncoArray study, data were available on 15,566 BRCA1 mutation carriers, of whom 7,784 were affected with breast cancer (Supplementary Table 3). We also obtained iCOGS genotype data on 3,342 BRCA1 mutation carriers (1,630 with breast cancer) from 54 studies through CIMBA. All BRCA1 mutation carriers provided written informed consent and participated under ethically approved protocols.
Genotyping, quality control and imputation. Details on genotype calling, quality control and imputation for the OncoArray, iCOGS and GWASs are described elsewhere 1,2,5,6 . Genotyped or imputed variants (including bi-allelic and multi-allelic single-nucleotide polymorphisms (SNPs) and small indels) marking each of the loci were determined using the iCOGS and OncoArray genotyping arrays and imputation to the 1000 Genomes Project (phase 3) reference panel. We included variants from each component GWAS with an imputation quality score of >0.3. We restricted analysis to variants with a minor allele frequency of >0.005 in the overall breast cancer analysis and >0.01 in the subtype analysis.
Known breast cancer susceptibility variants. Previous studies identified susceptibility variants from genome-wide analyses at a significance level of P < 5.0 × 10 −8 for all breast cancer types, estrogen-receptor-negative or estrogen-receptor-positive breast cancer, in BRCA1 or BRCA2 mutation carriers, or in meta-analyses of these 1-3 . We defined known breast cancer susceptibility variants as those variants that were identified or replicated in previous BCAC analyses 1,2 .
To help ensure that novel, independent susceptibility variants were identified, we excluded from these analyses variants within 500 kilobases (kb) of a previously published variant. These excluded regions were subject to separate, fine-mapping conditional analyses that were focused on identifying additional independent susceptibility variants in these regions 14 .
Standard analysis of BCAC data. Logistic regression analyses were conducted separately for the iCOGS and OncoArray datasets, adjusting for country and the array-specific first ten principal components for ancestry informative variants. The methods for estimating principal components have been described elsewhere 1,2 .
For the remaining GWASs, adjustment for inflation was done by adjusting for up to three principal components and using genomic control adjustment, as previously described 1 . We evaluated the associations between approximately 10.8 million variants with imputation quality scores (r 2 ) ≥ 0.3 and minor allele frequency (MAF) scores of >0.005. We excluded variants located within ±500 kb of, or in linkage disequilibrium (r 2 ≥ 0.1) with, known susceptibility variants 21 . The association effect size estimates from these GWASs, as well as the previously derived estimates from the 11 other GWASs, were then combined using a fixed-effects meta-analysis. Since individual-level genotyping data were not available for some previous GWASs, we conservatively approximated the potential overlap between the GWAS and iCOGS and OncoArray datasets, based on the populations contributing to each GWAS (iCOGS/GWAS: 626 controls and 923 cases; OncoArray/GWAS: 20 controls and 990 cases). We then used these adjusted data to estimate the correlation in the effect size estimates, and incorporated these into the meta-analysis using the method of Lin and Sullivan 23 .
Subtype analysis of BCAC data. We have described the two-stage polytomous logistic regression in more detail elsewhere 4,24 (Supplementary Note). In brief, this method allows for efficient testing of a variant-disease association in the presence of tumor subtype heterogeneity defined by multiple tumor characteristics, while accounting for multiple testing and missing data on tumor characteristics. In the first stage, the model uses a polytomous logistic regression to model case-control ORs between the variants and all possible subtypes that could be of interest, defined by the combination of the tumor markers. For example, in a model fit to evaluate heterogeneity according to estrogen receptor, progesterone receptor and HER2-positive/negative status and grade of differentiation (low, intermediate or high grade), the first stage incorporates case-control ORs for 24 subtypes defined by the cross-classification of these factors. The second stage restructures the first-stage subtype-specific case-control OR parameters through a decomposition procedure resulting in a baseline parameter that represents a case-control OR of a baseline cancer subtype, and case-case OR parameters for each individual tumor characteristic. The second-stage case-case parameters can be used to perform heterogeneity tests with respect to each specific tumor marker while adjusting for the other tumor markers in the model. The two-stage model efficiently handles missing data by implementing an expectation-maximization algorithm 4,7 that essentially performs iterative imputation of the missing tumor characteristics conditional on available tumor characteristics and baseline covariates based on the underlying two-stage polytomous model. In the two-stage model, the frequency of different tumor subtypes corresponding to different combinations of the tumor characteristics is allowed to vary freely through the model-free specification of the intercepts of the first-stage polytomous model (α m ; see Supplementary Note for details). In other words, the intercepts are kept saturated. As these parameters are estimated from the data themselves, the methodology accounts for the correlation among the tumor markers in a robust manner that does not require strong modeling assumptions.
To identify novel susceptibility loci, we used both a fixed-effects two-stage polytomous model and a mixed-effects two-stage polytomous model. The score test we developed based on the mixed-effects model allows coefficients associated with individual tumor characteristics to enter as either fixed-or random-effects terms. Our previous analyses have shown that incorporation of random-effects terms can improve the power of the score test by essentially reducing the effective degrees of freedom associated with fixed effects related to exploratory markers (that is, markers for which there is little previous evidence to suggest that they are a source of heterogeneity) 4 . In contrast, incorporation of fixed-effects terms can preserve distinct associations of known important tumor characteristics, such as estrogen receptor. In the mixed-effects two-stage polytomous model, we therefore kept estrogen receptor as a fixed effect, but modeled progesterone receptor, HER2 and grade as random effects. We evaluated variants with MAF > 0.01 (~10.0 million) and r 2 ≥ 0.3, and excluded variants within ±500 kb of, or in linkage disequilibrium (r 2 ≥ 0.1) with, known susceptibility variants. A MAF > 0.01 was chosen to ensure an adequate sample size to generate stable estimates. We reported variants that passed the P value threshold of P < 5.0 × 10 −8 in either the fixed-or mixed-effects models.
Both fixed-and mixed-effects models adjusted for the top ten principal components and age. As age is correlated with tumor characteristics 25 , we added age as a covariate to improve the statistical power of the expectation-maximization algorithm. Country was not adjusted for in the subtype analyses, since doing so required adequate sample sizes for each subtype in each country to allow for convergence of the two-stage polytomous model. Instead, we assessed the influence of country on signals identified by the two-stage models by performing a leave-one-out sensitivity analyses in which we reevaluated novel signals after excluding data from each individual country. Data from the OncoArray and iCOGS arrays were analyzed separately and then meta-analyzed using fixed-effects meta-analysis.
Statistical analysis of the CIMBA data. We tested for associations between variants and breast cancer risk for BRCA1 mutation carriers using a score test statistic based on the retrospective likelihood of observing the variant genotypes conditional on breast cancer phenotypes (breast cancer status and censoring time) 26 . Analyses were performed separately for iCOGS and OncoArray data.

Nature GeNetics
To allow for non-independence among related individuals, a kinship-adjusted test was used that accounted for familial correlations 27 . We stratified analyses by country of residence and, for countries where the strata were sufficiently large (United States and Canada), by Ashkenazi Jewish ancestry. The results from the iCOGS and OncoArray data were then pooled using fixed-effects meta-analysis.

Meta-analysis of BCAC and CIMBA.
As the great majority of BRCA1-related breast cancers are triple-negative 28 , we performed a meta-analysis with the BCAC triple-negative results to increase the power to detect associations for the triple-negative subtype. We performed a fixed-effects meta-analysis of the results from BCAC triple-negative cases and CIMBA BRCA1 mutation carriers, using an inverse-variance fixed-effects approach implemented in METAL 29 . The estimates of association used were the logarithm of the per-allele hazard ratio estimate for association with breast cancer risk for BRCA1 mutation carriers from CIMBA and the logarithm of the per-allele odds ratio estimate for association with risk of triple-negative breast cancer based on the BCAC data.

Conditional analyses.
We performed two sets of conditional analyses. First, we investigated for evidence of multiple independent signals in identified loci by performing forward selection logistic regression, in which we adjusted the lead variant and analyzed the association for all remaining variants within ±500 kb of the lead variants, irrespective of linkage disequilibrium. Second, we confirmed the independence of 20 variants that were located within ±2 Mb of a known susceptibility region by conditioning the identified signals on the nearby known signal. Since these 20 variants were already genome-wide significant in the original GWAS scan and the conditional analyses restricted to local regions, we used a significance threshold of P < 1 × 10 −6 to control for type I error 30 .

Heterogeneity analysis of new association signals.
We evaluated all novel signals for evidence of heterogeneity using the two-stage polytomous model. We first performed a global test for heterogeneity under the mixed-effects model test to identify variants showing evidence of heterogeneity with respect to any of the underlying tumor markers, estrogen receptor, progesterone receptor, HER2 and/or grade. We accounted for multiple testing of the global heterogeneity test using an FDR < 0.05 under the Benjamini-Hochberg procedure 31 . Among the variants with observed heterogeneity, we then further used a fixed-effects two-stage model to evaluate the influence of specific tumor characteristic(s) driving observed heterogeneity, adjusted for the other markers in the model. We also fit separate fixed-effects two-stage models to estimate case-control ORs and 95% CIs for five surrogate intrinsic-like subtypes defined by combinations of estrogen receptor, progesterone receptor, HER2 and grade 8 : (1) luminal A-like (estrogen receptor + and/or progesterone receptor + ; HER2 − ; grades 1 and 2); (2) luminal B/HER2-negative-like (estrogen receptor + and/or progesterone receptor + ; HER2 − ; grade 3); (3) luminal B-like (estrogen receptor + and/or progesterone receptor + ; HER2 + ); (4) HER2-enriched-like (estrogen receptor − and progesterone receptor − ; HER2 + ); and (5) triple-negative (estrogen receptor − ; progesterone receptor − ; HER2 − ). Furthermore, we conducted sensitivity analysis by fitting a standard polytomous model among cases with complete data on the five intrinsic-like subtypes for the 32 novel variants and compared these results with the results from the two-stage polytomous model accounting for missing tumor data.

CCVs.
We defined credible sets of CCVs as variants located within ±500 kb of the lead variants in each novel region and with P values within 100-fold of magnitude of the lead variants. This is approximately equivalent to selecting variants whose posterior probability of causality is within two orders of magnitude of the most significant variant 32,33 . This approach was applied for detecting a set of potentially causal variants for all 32 identified variants. For the novel variants located within ±2 Mb of the known signals, we used the conditional P values to adjust for the known signals' associations.

Enhancer states analysis in breast subpopulations.
We obtained enhancer maps for three enriched primary breast subpopulations (basal, luminal progenitor and mature luminal) from Pellacani et al. 12 . Enhancer annotations were defined as active, primed or off based on a combination of H3K27ac and H3K4me1 histone modification ChIP-Seq signals using fragments per kilobase of transcript per million mapped reads thresholds, as previously described 12 . Briefly, genomic regions containing a high H3K4me1 signal observed in any cell type were used to define the superset of breast regulatory elements. A subpopulation cell type-specific H3K27ac signal (which is characteristic of active elements) within these elements was used as a measure of overall regulatory activity, where active sites were characterized by H3K4me1-high/H3K27ac-high, primed sites were characterized by H3K4me1-high/H3K27ac-low, and off sites were characterized by H3K4me1-low/H3K27ac-low. This enabled annotation of each enhancer element as either off, primed or active in all cell types. We then defined enhancers that exhibited differing states between at least one cell type as switch enhancers.
Genetic correlation analyses. We used linkage disequilibrium score regression [18][19][20] to estimate the genetic correlation between five intrinsic-like breast cancer subtypes. The analysis used the summary statistics based on the meta-analysis of the OncoArray, iCOGS and CIMBA meta-analysis. The genetic correlation 18 analysis was restricted to the roughly 1 million variants included in HapMap 3 with a MAF of >1% and an imputation quality score r 2 of >0.3 in the OncoArray data. Since two-stage polytomous models integrated an imputation algorithm for missing tumor characteristic data, we modified the linkage disequilibrium score regression to generate the effective sample size for each variant (Supplementary Note).

Genetic variance explained by identified susceptibility variants and all genome-wide imputable variants.
Genetic variance corresponds to heritability on the frailty scale, which assumes a polygenetic log-additive model as the underlying model. Under the log-additive model, the frailty scale heritability explained by the identified variants can be estimated by: where n is the total number of identified variants, pi is the MAF for the ith variant, β i is the log[odds ratio] estimate for the ith variant and τ i is the standard error of β i . To obtain the frailty scale heritability for invasive breast cancer explained by all of the GWAS variants, we used linkage disequilibrium score regression to estimate heritability (σ 2 GWAS I ) using the full set of summary statistics from either standard logistic regression for overall invasive breast cancer, the two-stage polytomous regression for the intrinsic-like subtypes or the CIMBA BRCA1 analysis for BRCA1 carriers. σ 2 GWAS I is characterized by population variance of the underlying true polygenetic risk scores as σ 2 GWAS ¼ Var where G m is the standardized genotype for the mth variant, β m is the true log[odds ratio] for the mth variant and M is the total number of causal variants among the GWAS variants. Thus, the proportion of heritability explained by identified variants relative to all imputable variants is: To estimate the proportion of the familial risk of invasive breast cancer that is explained by susceptibility variants, we defined the familial relative risk, λ, as the familial relative risk assuming a polygenic log-additive model that explains all of the familial aggregation of the disease 34 . Under the frailty scale, we define the broad sense heritability 35 as σ 2 . The relationship between λ and σ 2 was shown to be 34 . We assumed λ = 2 as the overall familial relative risk of invasive breast cancer 34 ; thus, σ 2 = 2 × log [2] and the proportion of the familial relative risk explained by identified susceptibility variants is , and the proportion of the familial relative risk explained by GWAS variants is σ 2 GWAS = 2´log 2 ½  ½  I . Analyses of heritability and the proportion of explained familial risk were restricted to 106,278 invasive cases and 91,477 controls (Supplementary Table 2). In addition, we compared estimates of GWAS chip hereditability across five intrinsic subtypes using linkage disequilibrium score regression where the summary statistics were derived using either a standard polytomous model applied to complete cases or the novel two-stage method that incorporates cases with missing tumor characteristics.
PRSs for five intrinsic-like subtypes. We constructed PRSs for the intrinsic-like subtypes, incorporating the newly identified variants and 313 variants previously reported in the development of PRSs for overall and estrogen-receptor-specific breast cancer 22 . The 313 SNPs included SNPs that did not reach genome-wide significance. After excluding variants within 500 kb of the 313 SNPs or with a linkage disequilibrium r 2 ≥0.1, 17 of the 32 novel variants were independent of the 313 SNPs. The BCAC data were split into a training dataset and a test dataset with proportions of 80 and 20%, respectively. Half of the test dataset were five studies nested within prospective cohorts, including Karolinska Mammography Project for Risk Prediction of Breast Cancer-Cohort Study, Mayo Mammography Health Study, The Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial, The Sister Study and UK Breakthrough Generations Study (Supplementary Table 2), and the other half were randomly selected among the subjects in OncoArray, excluding studies of bilateral breast cancer, studies or sub-studies with oversampling for family history, cases with ambiguous diagnosis and cases with missing tumor characteristics. We obtained the overall and estrogen-receptor-specific log[odds ratios] for 313 SNPs by respectively fitting standard and estrogen-receptor-specific logistic regression on the training dataset. We obtained the log[odds ratio] for 330 SNPs by fitting the fixed-effects two-stage polytomous model for five intrinsic-like subtypes on the training dataset (Supplementary Table 19).
Reporting Summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
Summary-level statistics are available from http://bcac.ccge.medschl.cam.ac.uk/ bcacdata/ and http://cimba.ccge.medschl.cam.ac.uk/projects/. Requests for data can be made to the corresponding author or the Data Access Coordination Committees (DACCs) of BCAC (see above URL) and CIMBA (see above URL). BCAC DACC approval is required to access data from the 2SISTER, ABCS,  Table 6 for the 16 independent susceptibility variants identified using two-stage polytomous regression, accounting for tumor markers heterogeneity according to ER, PR, HER2, and grade. Note that 8 of the 16 variants were also detected in the overall breast cancer analysis (9) See Supplementary Table 7 for the 3 independent susceptibility variants identified in the CIMBA/BCAC-triple-negative TN meta-analysis. Note that rs78378222 was detected in both the analyses using the two-stage polytomous regression and in CIMBA/BCAC-triple-negative TN.

Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.

n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted

Software and code
Policy information about availability of computer code

Data collection
We used SAS and R to access and manage the data.

Data analysis
We used these softwares to finish all the analysis: 1) R version 3.6.0 2) LDSC version 1.0.1 3) METAL 2011/03/25 version 4) MatrixEQTL v2.2. The analysis code could be found in GitHub repository (https://github.com/andrewhaoyu/breast_cancer_data_analysis). The analysis completed in the submitted paper could be reproduced through the code in the GitHub repository screened on Sep 13, 2019 with git commit id as 872fc6e.
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub All studies must disclose on these points even when the disclosure is negative.

Sample size
We analyzed all the studies available for breast cancer genome-wide association studies (GWAS) and pathology information from the largest consortium.
Data exclusions We excluded OncoArray data from Norway (the Norwegian Breast Cancer Study) because there are no controls available from Norway with OncoArray data. We also excluded one Study (Leuven Multidisciplinary Breast Centre) contributing to the iCOGs dataset due to inflation of the test statistics that was not corrected by adjustment for the first ten PCs.

Replication
We analyzed the all studies available instead of dividing the dataset into discovery and replication sets because of statistical power. We checked consistency of the results across studies/countries for replicability.
Randomization This was an observational genetic association study, hence randomization was not relevant. The analyses were adjusted for country and ancestry informative principal components, as described in the methods

Blinding
The laboratories conducting the genotyping did not have access to the phenotypic data (i.e. were blinded). Moreover, genotype calling was automated. The phenotype and genotype data were only combined during the statistical analysis Reporting for specific materials, systems and methods We require information from authors about some types of materials, experimental systems and methods used in many studies. Here, indicate whether each material, system or method listed is relevant to your study. If you are not sure if a list item applies to your research, read the appropriate section before selecting a response. The analysis of data from Breast Cancer Association Consortium (BCAC) included women aged 16 years or older of European ancestry from 82 BCAC studies from over 20 countries, with genotyping data derived from two Illumina genome-wide custom arrays, the iCOGs and OncoArray. Most of the studies were case-control studies in the general population, or hospital setting, or nested within population-based cohorts, but a subset of studies oversampled cases with a family history of the disease We included controls and cases of invasive breast cancer, carcinoma in-situ, and cases of unknown invasiveness. Information on clinicopathologic characteristics were collected by the individual studies and combined in a central database after quality control checks. We used BCAC database version 'freeze' 10 for these analysis. All participating studies were approved by their appropriate ethnics or institutional review board and all participants provided informed consent. The total sample size from BCAC including iCOGs, OncoArray, and other GWAS data, comprised 133,384 cases and 113,789 controls. Participants included from Consortium of Investigators of Modifiers of BRCA1/2 (CIMBA) were women of European ancestry, aged 18 years or older with a pathogenic BRCA1 variant. Most participants were sampled through cancer genetics clinics. In nature research | reporting summary October 2018 some instances, multiple member of the same family were enrolled. OncoArray genotype data was avaiable from 58 studies from 24 countries. Following quality control and removal of participants that overlapped with BCAC OncoArray study, data were available on 15,566 BRCA1 mutation carriers, of whom 7,784 were affected with breast cancer. We also obtained iCOGs genotype data on 3,342 BRCA1 mutation carriers (1,630 with breast cancer) from 54 studies through CIMBA. All BRCA1 MUTATION carriers provided written informed consent and participated under ethically approved protocols.

Recruitment
Participants were recruited from epidemiology studies and selection factors are very unlikely to be related to genotypes.

Ethics oversight
All the studies were approved by local IRBs.
Note that full information on the approval of the study protocol must also be provided in the manuscript.