By Laurence Collette, PhD, Scientific Development Leader, EORTC HQ, Brussels, Belgium
P-values are ubiquitous in the medical literature. Unfortunately, there is a general tendency to put too much emphasis on them. Too often, one sees the scientific community overemphasize or disregarded results of medical experiments based on solitary P-value. That P-value being “significant” determines if a trial was “a success”.
P-values are effectively an extremely simplified statistical summary of the results of otherwise very complex scientific experiments, be they clinical trials or biomarker association studies. Their simplicity makes them attractive but is also the cause of their abuse.
P-values and the notion of statistical significance are very often misinterpreted and misunderstood. For this reason and because of their effective shortcomings, (which we will address below), the past years has seen a number of scientists rage against their use in provocative editorials “Scientists rise up against statistical significance” (Amrhein V et al Nature 2019, Sucz and Ioannidis 2017), or propose to proscribe statistical significance (Ioannidis 2018, Ioannidis 2019). Statisticians themselves have dedicated an entire journal volume to the debate « Moving to a world beyond P<0.05” (Wasserstein 2019).
So, what are P-values exactly and why are they so often abused?
The P-value is a concept invented by Sr. Ronald Fisher in 1925 (Fisher 1925) to test a null hypothesis ((H0) concerning the magnitude of a parameter of interest (e.g. mean of measurements, difference in response rates, correlation between two biomarker measurements, hazard ratio in a comparative survival study). The P-value is the probability to observe a result, at least, as extreme as the one observed in the experiment, under the assumption that the null hypothesis (that the parameter is zero) was actually true.
The P-value measures how compatible the observed results are with the stated null hypothesis. If one assumes that the data follow the stated imaginary distribution and that there is no bias in trial conduct or data collection, then the data should exhibit only random fluctuations around the hypothesised (null) value of the parameter of interest.