E.7 Study Design and Statistical Issues

April 2008 March 2010, Current
POLICY : Every cancer genome project should state a clear rationale for its choice of sample size, in terms of the desired sensitivity to detect cancer relevant changes. The target number of 500 is set as a minimum for common tumor types/subtypes; more than 500 samples may be required for tumors that demonstrate considerable heterogeneity. There are circumstances when 500 samples of a tumor type or subtype may be impractical (such as a rare cancer) or unnecessary (such as a tumor subtype that is known to be relatively homogeneous, based on pre-existing molecular studies). ICGC members proposing to tack projects with less than 500 samples should provide the rationale for the choice of sample size.

The following considerations should be taken into account in planning:

  • As a rule of thumb, it is suggested that a tumor/normal collection should be large enough to reliably detect genes that are somatically mutated in 3% of tumors of a given subtype. This is based on the recognition that cancer types can be heterogeneous, with important genes already being found as mutated in 5-10% of samples;
  • Based on mathematical analysis, a collection of ~500 samples is needed to reliably detect genes that are somatically mutated in 3% of samples;
  • It may not be necessary to fully analyze all genes in 500 samples. Instead, one might use a two-tiered strategy in which (i) genes are studied in a discovery set (N samples) and (ii) a subset of genes that show sufficient frequency of mutations are studied in a validation set (M samples). With N= 100 and M= 400, one still has good power to detect genes that are mutated in 3% of samples;
  • While we suggest a detection level of 3% as a rule of thumb for a ‘typical’ cancer, the detection level should ideally reflect the actual heterogeneity of the cancer subtype. A gene could be mutated in a significant proportion of a subtype, but the overall mutation rate might fall below 3%. In practice, the degree of heterogeneity of a given tumor type is difficult to know in advance.

Nonetheless, some tumor types are known or thought to have more heterogeneous etiologies (for example, sarcomas), which may entail significantly more heterogeneous patterns of genomic and (epi)genetic alterations. In such cases, it could make sense to collect considerably more than 500 tumors.

In other cases, it may make sense to divide cancer types into distinct subtypes based on etiology or biology and, if feasible, assemble collections of each subtype. For example, investigators might be interested in identifying cancer genes associated with distinct subtypes. Examples might include studying smoking-related versus non-smoking-related lung cancers; or hepatocellular carcinomas arising in the setting of alcoholic cirrhosis versus viral hepatitis (B and C) versus helminthic infections versus aflatoxin.

Ultimately, the decision about sample collections must reflect the investigators’ best guesses about the underlying heterogeneity and the practical realities of sample collection. It is good to have larger collections at hand, even if only a subset is initially analyzed. The ultimate answer about the degree of heterogeneity will likely come from the genomic data themselves.

Box 5. Mathematical analysis

We briefly outline the mathematical analysis that supports the statements above.

Sample size. To identify cancer-related genes (drivers vs. passengers), one needs to detect genes that are mutated at a higher frequency than the background mutation rate. One has to calculate the probability of observing a given number of somatic mutations in the coding region of (i) a passenger gene in which somatic mutations occur at the background rate and (ii) a driver gene in which somatic mutations occur in 3% of samples.

Background mutation rates can vary between tumors and tumor types, but a typical rate is around 1.5x10-6 non-synonymous mutations/base. If we make the simplifying assumption that all genes have a coding region of 1500 bases, this translates to a background rate of 2.25 x 10-3 somatic mutations per gene - or an expectation of ~0.625 somatic mutations across a collection of 500 samples). Because there are 20,000 protein-coding genes, some genes will substantially exceed the expectation by random chance. Indeed, one expects that by chance there will be ~3.4 passenger genes with ≥7 non-synonymous mutations. One must take into account this issue of multiple hypothesis testing – for example, by using a Bonferroni correction.

By contrast, a driver gene in which somatic mutations occur in 3% of samples would be expected to have ~15 occurrences among a collection of 500 samples.

If one sets a threshold of 9 somatic mutations across 500 samples to declare significance, the probability that some passenger gene in the genome will exceed this threshold is ~6%. By contrast, the probability that a driver gene (3% somatic mutation rate) will exceed the threshold is 98%. If we allow for a missing data rate of ~24% due to incomplete coverage and sensitivity, the probability is 88%.

In summary, a sample of 500 tumors thus provides 88% power to detect a gene mutated in 3% of samples, with a 10% chance of a passenger gene achieving the threshold.

We note that this analysis is only approximate. (i) For example, the genes are assumed to have equal size. More sophisticated statistical models should be used in analyzing actual data from cancer genome projects. (ii) The model uses an average mutation rate per base; it does not reflect differential mutation rates in different nucleotide contexts.

In addition, the sample size analysis focuses only on detection of cancer-related mutations. Different samples sizes may be required, for example, to make accurate risk estimates.

Two-stage design. Using the background mutation rate above, about 4,000 out of the 20,000 genes will have at least one mutation in the first 100 discovery set. Sequencing these 4,000 genes in the remaining 400 samples and requiring a total of at least 9 mutations (in combined discovery and validation sets) only slightly decreases the power to detect a gene which is mutated in 3% of samples, from 88% to 82%. However, this two-tiered strategy can reduce the sequencing costs to 28% of the single-tiered approach.