Today we’re going to get a bit meta. Just as important, if not more important, than science is the science of science itself. How do we know if the papers being published every day of the week are of a high enough standard to be trusted and to take human knowledge forward? What proportion of them will turn out to be wrong in time? How much of this is just due to a natural progression of knowledge and how much of it is due to shoddy work that should have been rooted out pre-publication?
I think science has a problem, not a fatal one, but one that it needs to address. It is increasingly the case that journals are only interested in publishing the papers that will get the most headlines and/or the most citations thereby increasing that journal’s Impact Factor. Researchers will naturally want to publish in the journals with the highest Impact Factor and may, on occasion, massage things to help ensure they do so. There is increasingly little space for papers with a negative result or replications of previous experiments, both of which are absolutely vital to science but are not sexy or headline grabbing.
The easiest way to get published is to have a statistically significant P-value. Broadly speaking, the P-value is the likelihood that a result in an experiment could have been obtained by random chance and not as a result of whatever theory you might be testing. The smaller the P-value the more likely the hypothesis you’re testing is correct. But the problem is that there are lots of different ways to generate a P-value. Different data sets suit different types of statistical analysis and within each analysis there will be certain parameters and limits to set. How and where these limits are set can give very different results perhaps pushing a negative data set just over the margin into significance.
I should say that this can all be done completely innocently. Researchers won’t be malevolently thinking of ways to con the world into thinking that they have a real effect when they don’t. All of the little decisions that go into designing a research study, of any kind, can be referred to as Researcher Degrees of Freedom (RDFs). Multiple studies have now shown that the more RDFs you have the more likely there is to be significance found in the analysis. Decisions about when to stop collecting data, which observations to exclude, which comparisons to make, which data sets to combine; these all have an impact on the final results. The phenomenon has come to be known as P-Hacking.
In an open access article published last year in PLOS ONE researchers from the US Department of Health and Human Services detailed an interesting but slightly worrying observation. What they did was to look at every large study looking at cardiovascular disease conducted at the National Heart, Lung and Blood Institute between 1970 and 2012. They defined large as costing more than $500,000 per year to run. There were 55 such trials. These were then segregated into those that took place before (30) and after (25) the date in 2000 when it became compulsory to register your clinical trial and specifically what it was going to do before publication.
17 out of 30 studies (57%) published before 2000 had a positive result; but only 2 out of 25 (8%) had a positive result after 2000. No other factors looked at; like corporate co-sponsorship of the work or whether the trial compared against a placebo or an active comparator; made any difference to the figures. This one simple measure, of forcing scientists to register exactly what the parameters of their study would be before publication, seems to have led to a 7 fold decrease in the number of positive trials.
This is, of course, just one study; one study is never proof of anything. This needs to be replicated in multiple data sets by different groups to see if the effect is real. If it is, it could have profound implications for randomised controlled studies the world over. To be clear, if there is a bad scientific article published it will get found out in time. The scientific method and the peer review process are not fundamentally broken, but a lot of people might waste a lot of time and research money on a dead end and in these straightened times we cannot afford as a community, as a people, to be led a merry dance on effects that weren’t even there in the first place.