I just wanted to make that clear in case any of you fellas were afraid your P value might be <0.05.
Actually I have a more serious reason for this post. A few people have commented on my “Good Fats, Bad Fats” article, expressing their wonder at why I gave any credence to the rise in total cancer on the vegetable oil diet in the LA Veterans Administration Hospital Study when it wasn’t statistically significant, with a P value of 0.06.
I should say upfront that I am no expert in statistics, so I’d rather offer what I have to say below as a starting point for a discussion on this issue rather than a definitive declaration of truth.
The P value is the probability that we would obtain a difference this large if in fact there were no difference at all. In technical jargon, it’s the probability that we would obtain a difference this large if the null hypothesis were true.
In this case, we want to know if vegetable oils cause cancer. To put that in more practical terms, we want to know what would happen if the population at large started consuming vegetable oils and abandoned traditional animal fats. Would there be more cancer than there otherwise would have been? Less?
To test this, we randomly allocate people to consume either vegetable oils or animal fats. The control group eating animal fats (and unfortunately some hydrogenated vegetable oil in this study) represents the “otherwise would have been” in the the paragraph above. If we placed everyone on vegetable oil, we’d have no idea whether the vegetable oil was the cause of whatever happened rather than, say, getting older, the housing bubble, anxiety over the Greek financial crisis, Venus crossing the sun in June for the first time in over a hundred years, or the well established Bieber Effect. By having each group consume its respective diet concurrently, each group is equally exposed to all these tumultuous processes and events. We randomly allocate people to ensure that the myriad known and unknown confounders are relatively evenly distributed between the two groups (which I’ve discussed in more detail here, and which unfortunately didn’t work very well for distributing smoking habits evenly in this study).
Thus, randomization between a treatment and a control group is the most powerful tool we have for demonstrating a cause-and-effect relationship because it isolates the “cause” as purely as possible (randomized) and allows us to measure the “effect” by comparing the two groups (controlled). Put these two principles together and we have a randomized, controlled trial.
But here’s the catch.
For our study, we take not the whole population we care about ā simply because that would cost far too much money and be incomprehensibly impractical to carry out ā but we instead select a sample of people from that population. The downside of taking such a sample is that we introduce some random error. We might imagine that there are thousands of things that could affect the incidence of cancer ā many different genes, household contamination with mold or environmental toxins, other dietary factors, lifestyle factors, and so on ā and there’s simply no way that the distribution of each of those things will be exactly the same between our study sample and the population at large.
Thus, if we repeated this study a hundred times, we wouldn’t expect to get the same exact result every time. If vegetable oils truly increase the risk of cancer, we would expect the vegetable oil group to have a higher incidence of cancer in most of those hypothetical hundred studies, but the size of the difference would vary from one study to another. This variation would result from “random” error: from (usually small) differences in the distribution of that great and swarming sea of known and unknown confounding variables between each sample of study participants from one study to the next.
So our null hypothesis is that there would be no difference in the incidence of cancer between vegetable oils and animal fats at the population level. If this null hypothesis were true, the probability of obtaining a difference as large as the one shown in the graph above in a sample of that population would be six percent.Ā We call this probability P, and thus we say that P = 0.06. To phrase this in a more intuitively graspable way, we can be 94% confident that this difference was not due to chance.
Here is what I had therefore written in the caption of the figure:
The difference between the groups reaches the border of statistical significance at P=0.06, meaning we can be 94 percent confident that the difference is not due to chance.
In most cases, researchers would call this “borderline significant” or say there was a “trend” towards an increase in cancer because they would consider something “significant” only if P were less than 0.05. I paid due homage to this phrasing in the caption but such phrases really are unfortunate terms that play into the myth that the significance level is anything more than an arbitrary declaration by fiat that something is or isn’t “significant.”
I’ve sometimes seen it said that being “borderline significant” is like being “borderline pregnant.” Something either is or isn’t significant, according to this view, just as someone either is or isn’t pregnant. But this simile is wrong on two counts: first, the P value is a continuous variable, not a categorical one like pregnancy, so of course something can be close to being significant just like you could come close to being in Utah if you traveled north from Phoenix and stopped a mile from the border; second, since the level of significance is purely arbitrary, we could theoretically set it to whatever we want. If we said that something is significant when P<0.1 then suddenly P = 0.06 would be significant!Ā Like magic!
There is no reason other than prevailing convention that we couldn’t simply set the level of significance at 0.1 instead of 0.05, just like we could set it at a more rigorous level of 0.01.Ā And the prevailing convention is upheld not at the insistence of statisticians ā after all, most scientific papers do not have statisticians as co-authors, most scientific papers are not peer-reviewed by statisticians, and journals that publish experimental science often do not have devoted statistical editors ā rather, the convention is upheld by force of habit, and is at least occasionally criticized by statisticians. For example, this is what the textbook Statistical Methods in Medical Research (2001; 2008) has to say (p. 88 in the 2008 edition):
The 5% level and, to a lesser extent, the 1% level have become widely accepted as convenient yardsticks for assessing the significance of departures from a null hypothesis. This is unfortunate in a way, because there should be no rigid distinction between a departure which is just beyond the 5% significance level and one which just fails to reach it. It is perhaps preferable to avoid the dichotomy ā ‘significant’ or ‘not significant’ ā by attempting to measure how significant the departure is.
A convenient way of measuring this is to report the probability, P, of obtaining, if the null hypothesis were true, a sample as extreme as, or more extreme than the sample obtained. One reason for the origin of the use of the dichotomy, significant or not significant, is that significance levels had to be looked up in tables, such as Appendix Tables A2, A3 and A4 [in this book], and this restricted the evaluation of P to a range. Nowadays significance tests are usually carried out by a computer and most statistical computing packages give the calculated P value. It is preferable to quote this value and we shall follow this practice. However, when analyses are carried out by hand, or the calculated P value is not given in computer output, then a range of values could be quoted. This should be done as precisely as possible, particularly when the result is of borderline significance; thus, ’0.05 < P < 0.1′ is far preferable to ‘not significant (P > 0.05)’.
In my view, the level of confidence we require for making a given decision should depend on the potential impact of that decision. For example, if we want to gain insight into how something works, it might make sense to take a scientific finding very seriously when P<0.1, because that might represent an important lead for future scientific work. That future work could attempt to expand on the earlier finding while also using a larger sample size in attempt to obtain a lower P value.
If we want to show that a drug reduces the risk of disease, on the other hand, we might want something much stricter, like P<0.01 or P<0.001, because we know from experience that drugs are often more harmful than they first appear in clinical trials, and newer drugs are usually more expensive than old ones, so we don’t want to adopt an expensive and potentially dangerous drug on flimsy evidence for its efficacy. On the other hand, for the very same reason we might take a more liberal approach when assessing the potential harm of something new. If a study shows that P = 0.06 that vegetable oils cause cancer, we should be much more cautious about swigging the soybean oil.