Share

Unmasking P-values

Where is the problem then? There are several common mistakes associated with interpretation of P-values.

First, as the example just above illustrates, if a difference of response rate of 5% is not judged clinically relevant, the significant P-value in the trial with 800 patients per group is irrelevant from the clinical standpoint. One often encounters this phenomenon when testing on binary or continuous endpoints, such as typically HRQOL measurements in trials powered on a survival endpoint. Statistical significance is not proof of clinical relevance. To judge the latter, one should rather consider the estimate of the effect size and the confidence interval.  (Schroeber P 2018),

The point estimate is the effect size most compatible with the data (the P value would be 1 under the assumption that the true effect equals the observed effect). The confidence interval is the range of values of the true effect that would produce P>0.05, i.e. that would be compatible with the data (Greenland et al 2016).  At EORTC, we always report confidence intervals and effect estimates. In addition to the confidence interval, we now implement to report as well the Bayesian credible interval, which has the more straightforward interpretation: given the data, the effect has 95% probability of falling in the credible interval.

The notion of statistical power best explains the influence of sample size on tests and p-values. The power of a test is the theoretical probability of achieving a P-value lower than the specified significance level (to reject the null hypothesis) for a given sample size and a given assumption about the magnitude of the effect. The power is an increasing function of sample size and of the hypothesised effect size. It is not related to the observed effect. Notably, the statistical power under the null hypothesis of no effect is just… the statistical significance level! Figure 1 illustrates how the power increases with the size of the actual difference, when comparing two proportions in an experiment of 40 patients per arm. It shows that if the true difference were 20%, the power would be 64%. It would be 80% if the true difference were 25%. In contrast, for any effect size smaller than 18% the power is less than 50%, so that a trial of that size has more than half chance of ending up inconclusive. This is why EORTC enforces prospective sample size calculations and includes considerations of statistical power in the interpretation of the study results.

Back to news list

Related News

  • EORTC: Advancing research and treatment for rare cancers

  • EORTC Fellowship Programme: celebrating more than 20 years of impactful collaboration

  • Appointment of Malte Peters as EORTC Strategic Alliance Officer

  • Unique series of workshops in partnership with the European Medicines Agency (EMA)

  • EORTC launches a prominent clinical trial in older patients with locally advanced (LA) HNSCC (Head and Neck Squamous Cell Carcinoma)

  • Seven IMMUcan abstracts selected for ESMO Immuno-Oncology Congress 2023

  • EORTC Quality of Life measures integrated in CDISC

  • EORTC and Immunocore are collaborating to launch the ATOM clinical trial of tebentafusp in Adjuvant Uveal Melanoma

  • Treatment with decitabine resulted in a similar survival and fewer adverse events compared with conventional chemotherapy in older fit patients with acute myeloid leukaemia

  • New results and forthcoming EORTC trials in rare cancers, lung, head and neck, and breast carcinomas presented at ESMO 2023