I just wanted to make that clear in case any of you fellas were afraid your P value might be <0.05.
Actually I have a more serious reason for this post. A few people have commented on my “Good Fats, Bad Fats” article, expressing their wonder at why I gave any credence to the rise in total cancer on the vegetable oil diet in the LA Veterans Administration Hospital Study when it wasn’t statistically significant, with a P value of 0.06.
I should say upfront that I am no expert in statistics, so I’d rather offer what I have to say below as a starting point for a discussion on this issue rather than a definitive declaration of truth.
The P value is the probability that we would obtain a difference this large if in fact there were no difference at all. In technical jargon, it’s the probability that we would obtain a difference this large if the null hypothesis were true.
In this case, we want to know if vegetable oils cause cancer. To put that in more practical terms, we want to know what would happen if the population at large started consuming vegetable oils and abandoned traditional animal fats. Would there be more cancer than there otherwise would have been? Less?
To test this, we randomly allocate people to consume either vegetable oils or animal fats. The control group eating animal fats (and unfortunately some hydrogenated vegetable oil in this study) represents the “otherwise would have been” in the the paragraph above. If we placed everyone on vegetable oil, we’d have no idea whether the vegetable oil was the cause of whatever happened rather than, say, getting older, the housing bubble, anxiety over the Greek financial crisis, Venus crossing the sun in June for the first time in over a hundred years, or the well established Bieber Effect. By having each group consume its respective diet concurrently, each group is equally exposed to all these tumultuous processes and events. We randomly allocate people to ensure that the myriad known and unknown confounders are relatively evenly distributed between the two groups (which I’ve discussed in more detail here, and which unfortunately didn’t work very well for distributing smoking habits evenly in this study).
Thus, randomization between a treatment and a control group is the most powerful tool we have for demonstrating a cause-and-effect relationship because it isolates the “cause” as purely as possible (randomized) and allows us to measure the “effect” by comparing the two groups (controlled). Put these two principles together and we have a randomized, controlled trial.
But here’s the catch.
For our study, we take not the whole population we care about — simply because that would cost far too much money and be incomprehensibly impractical to carry out — but we instead select a sample of people from that population. The downside of taking such a sample is that we introduce some random error. We might imagine that there are thousands of things that could affect the incidence of cancer — many different genes, household contamination with mold or environmental toxins, other dietary factors, lifestyle factors, and so on — and there’s simply no way that the distribution of each of those things will be exactly the same between our study sample and the population at large.
Thus, if we repeated this study a hundred times, we wouldn’t expect to get the same exact result every time. If vegetable oils truly increase the risk of cancer, we would expect the vegetable oil group to have a higher incidence of cancer in most of those hypothetical hundred studies, but the size of the difference would vary from one study to another. This variation would result from “random” error: from (usually small) differences in the distribution of that great and swarming sea of known and unknown confounding variables between each sample of study participants from one study to the next.
So our null hypothesis is that there would be no difference in the incidence of cancer between vegetable oils and animal fats at the population level. If this null hypothesis were true, the probability of obtaining a difference as large as the one shown in the graph above in a sample of that population would be six percent. We call this probability P, and thus we say that P = 0.06. To phrase this in a more intuitively graspable way, we can be 94% confident that this difference was not due to chance.
Here is what I had therefore written in the caption of the figure:
The difference between the groups reaches the border of statistical significance at P=0.06, meaning we can be 94 percent confident that the difference is not due to chance.
In most cases, researchers would call this “borderline significant” or say there was a “trend” towards an increase in cancer because they would consider something “significant” only if P were less than 0.05. I paid due homage to this phrasing in the caption but such phrases really are unfortunate terms that play into the myth that the significance level is anything more than an arbitrary declaration by fiat that something is or isn’t “significant.”
I’ve sometimes seen it said that being “borderline significant” is like being “borderline pregnant.” Something either is or isn’t significant, according to this view, just as someone either is or isn’t pregnant. But this simile is wrong on two counts: first, the P value is a continuous variable, not a categorical one like pregnancy, so of course something can be close to being significant just like you could come close to being in Utah if you traveled north from Phoenix and stopped a mile from the border; second, since the level of significance is purely arbitrary, we could theoretically set it to whatever we want. If we said that something is significant when P<0.1 then suddenly P = 0.06 would be significant! Like magic!
There is no reason other than prevailing convention that we couldn’t simply set the level of significance at 0.1 instead of 0.05, just like we could set it at a more rigorous level of 0.01. And the prevailing convention is upheld not at the insistence of statisticians — after all, most scientific papers do not have statisticians as co-authors, most scientific papers are not peer-reviewed by statisticians, and journals that publish experimental science often do not have devoted statistical editors — rather, the convention is upheld by force of habit, and is at least occasionally criticized by statisticians. For example, this is what the textbook Statistical Methods in Medical Research (2001; 2008) has to say (p. 88 in the 2008 edition):
The 5% level and, to a lesser extent, the 1% level have become widely accepted as convenient yardsticks for assessing the significance of departures from a null hypothesis. This is unfortunate in a way, because there should be no rigid distinction between a departure which is just beyond the 5% significance level and one which just fails to reach it. It is perhaps preferable to avoid the dichotomy — ‘significant’ or ‘not significant’ — by attempting to measure how significant the departure is.
A convenient way of measuring this is to report the probability, P, of obtaining, if the null hypothesis were true, a sample as extreme as, or more extreme than the sample obtained. One reason for the origin of the use of the dichotomy, significant or not significant, is that significance levels had to be looked up in tables, such as Appendix Tables A2, A3 and A4 [in this book], and this restricted the evaluation of P to a range. Nowadays significance tests are usually carried out by a computer and most statistical computing packages give the calculated P value. It is preferable to quote this value and we shall follow this practice. However, when analyses are carried out by hand, or the calculated P value is not given in computer output, then a range of values could be quoted. This should be done as precisely as possible, particularly when the result is of borderline significance; thus, ‘0.05 < P < 0.1′ is far preferable to ‘not significant (P > 0.05)’.
In my view, the level of confidence we require for making a given decision should depend on the potential impact of that decision. For example, if we want to gain insight into how something works, it might make sense to take a scientific finding very seriously when P<0.1, because that might represent an important lead for future scientific work. That future work could attempt to expand on the earlier finding while also using a larger sample size in attempt to obtain a lower P value.
If we want to show that a drug reduces the risk of disease, on the other hand, we might want something much stricter, like P<0.01 or P<0.001, because we know from experience that drugs are often more harmful than they first appear in clinical trials, and newer drugs are usually more expensive than old ones, so we don’t want to adopt an expensive and potentially dangerous drug on flimsy evidence for its efficacy. On the other hand, for the very same reason we might take a more liberal approach when assessing the potential harm of something new. If a study shows that P = 0.06 that vegetable oils cause cancer, we should be much more cautious about swigging the soybean oil.
Read more about the author, Chris Masterjohn, PhD, here.
🖨️ Print post
Steve Brecher says
“…P = 0.06. To phrase this in a more intuitively graspable way, we can be 94% confident that this difference was not due to chance.”
This is a common non sequitur. P=0.06 means that if the null hypothesis is true, we’d observe this difference only 6% of the time. But that provides no information on the probability that the null hypothesis is false, i.e., that there is a difference in the entire population.
Actually, I’m not sure what “we can be 94% confident” is intended to mean. I’m assuming it means that there is a 94% probability that this difference is not due to chance, i.e., is due to there being a population difference, i.e., that the null hypothesis is false. But the 0.06 calculation is based on the assumption that the null hypothesis is true.
For more, see Odds are, it’s wrong; scroll down to “BOX 2: The Hunger Hypothesis”.
Chris Masterjohn says
Hi Steve,
Thanks for your comments. I didn’t say anything about the probability that the null hypothesis is false. Whether the null hypothesis is true or false is a fact, not a probability, so there can’t be a probability that the null hypothesis is false. The “confidence” interpretation is the basis of the “confidence interval.” If you calculated a 94% confidence interval, one of the borders of the interval would be precisely the point that would generate a P of 0.06. If you calculate a 94% confidence interval, and imagine a scenario where you have repeated the construction of this interval from a sample many times, then 94% of the time the interval would include the parameter. Thus we would say that we are 94% confident that the interval includes the parameter, and this is correct. But it would be a fallacy to say the parameter has a 94% chance of being in the interval, because it either is or it isn’t. In this particular case, we could say that we are 94% confident that the difference in group means is greater than zero because the construction of a 94% confidence interval around the estimated difference would not include zero. Thus, if the P value is 0.06, by implication, the 94% confidence interval just barely excludes a difference of zero, so we are 94% confident there is a difference.
Again, this is not the same thing as there being a 94% chance there is a difference, which is a logical fallacy, and I’m guessing it’s the one you mean to be correcting.
Chris
Howard says
“well established Bieber Effect.”
Yeah, I had a lot of fun reading Denise’s article. But then, getting something significant from Denise’s writing has a P<0.001 🙂
Chris Masterjohn says
Now that’s a null hypothesis just downright foolish to postulate. 🙂
Chris
LeonRover says
I submit the more useful description for DM’s writing is “non dull”.
Bill Lagakos says
I’d challenge only your claim to no expertise in statistics. Calculations aside, your insight into the philosophy of p-values is, well, insightful. After all, this is biology, not mathematics.
Chris Masterjohn says
Thanks Bill!
Benji says
Straight from page 35 of their main paper:
“Two reasonable alternative explanations for the excess nonatherosclerotic mortality in the late part of the study remain. The first, and probably the most plausible is that it was due to chance alone. Alternatively, one may postulate that in some of these elderly men, many of whom were in precarious condition because of chronic disease, death at a given time was nearly inevitable and that the experimental diet simply modified the manner of exitus.”
The second explanation given by the authors and included with the fact that -absolute- cancer mortality was minimal (only 48 out of 352 total deaths) suggest to me like the first explanation (that the difference was due to chance) becomes increasingly attractive.
What do you think Chris?
Chris Masterjohn says
Hi Benji,
On this topic I would recommend reading their paper on cancer rather than reading the main paper. It’s most likely a result of vegetable oil for a few reasons. One is that vegetable oil increases cancer in certain animal models, one is that the P value is so low, one is that the trend increases over time, and one is that there is no comparable and contrary evidence. In other words, there are no other trials of similar duration in a population this advanced in age with this level of control that show otherwise, so what we are left with is considerable reasons to thing it is a causal effect and almost nothing indicating otherwise. Moreover, there were more smokers in the control group and there was ten times as much vitamin E in the vegetable oil diet, so if anything the study was biased in favor of minimizing the effect, and may have underestimated the true effect of vegetable oils.
Chris
Aaron Blaisdell says
Great post! It’s nice to see periodic discussions of statistics, what they mean, and what they are useful for. I generally think of the calculated p value as telling me the likelihood that replicating the experimental manipulation with a new sample, and perhaps more power (increased n) would yield results that lead one to reject the null hypothesis at the conventional criterion level of significance (typically .05).
In the field of experimental psychology, there has been a lot of debate about whether conventional stats with pre-determined alpha levels of significance are still useful, or if we should instead adopt other measures, such as effect size, or even other types of stats, such as Bayesian statistics. My friend and colleague, John Kruschke at University of Indiana has been championing the replacement of the old fashioned p-value stats with Bayesian stats.
Chris Masterjohn says
Hey Aaron,
Thanks! Yes, cool ideas. To be honest I don’t know that much about Bayesian stats, except in basic principle, but it seems to me that using them more often in addition to P values and confidence intervals would be better than using them to replace the latter. Mainly because while I think it’s useful to try to quantify the totality of the data, doing so mathematically can never replace rigorous review of the existing evidence. An estimate of the totality would be profoundly affected by the choice of which evidence to include, how to weight it, and so on. So perhaps we can arrive at something where we attempt to incorporate the other data using several different assumptions, and evaluate a study on its own with and without incorporating that data.
Chris
deb says
Hi Chris
Just wanted to say I enjoyed this article ginourmously. 🙂 xo deb
Chris Masterjohn says
Hi Deb
Just wanted to say I enjoyed this comment ginormously. 🙂 xo chris
David I says
I’m glad to see this. Most statisticians flinch when they see how their tools are applied in the biosciences. (I’m not a statistician–my undergraduate was physics. But I’ve known a lot of stats folks.)
In addition to the problems with “significance,” another way that the biomed field tends to mislead people is by reporting changes in risk. The simplest form of this–which isn’t wrong, but just misleading–is to talk in terms of somehitng doubling the risk of something that is very low-risk to beging with…and remains very low-risk even when doubled. But there are other strange ways of assessing risk that are also quite misleading.
Have you read “Overdiagnosed”? Great book. The author argues rather convincingly that there is very little positive effect from early detection of most cancers. Then why all the stats showing that early detection improves survival?
When you look at it, it’s so simple as to be laughable. The metric is usually five-year survival rates. Let’s say that someone without treatment will die of a cancer at age 85, and that symptoms would normally be detected at age 83.
If early detection allows us to find the cancer six years earlier, at age 77, then even if the person still dies at 85, they become part of the statistics proving that early detection increases five-year survival–because the person is still alive at 82…
Sigh.
Darcy Hemstad says
Thanks for the great article. Can improve your p value for pregnancy by minimizing polyunsaturated fats! 🙂 Darcy Hemstad, Fertilty Nurse Consultant
Ned Kock says
It is always worth noting that the P value is sensitive to sample size, and that there are other measures that should also be reported together with it, but rarely are.
A P value may be 0.13 with a sample of 100, but may go down to < 0.001 with a sample of 10,000, for the same coefficient of association. That is, the strength of the effect may be the same, but the associated P value may be much lower if the sample size goes up significantly.
Effect size (reflecting the strength of an effect) is a measure that rarely is reported, but that is NOT affected by sample size at all, and thus should be reported.
There are different types o effect size statistics. For example, a software tool that I often use, WarpPLS, reports Cohen's f-squared effect size statistic:
http://warppls.com/
Good points Chris, as always.
Chris Wilson says
Excellent discussion Chris. You’ve hit the nail on the head as far as interpreting parametric statistics- what p-value and confidence intervals really mean.
I also second Ned’s call for more up-front reporting of effect sizes next to p-values and sample sizes. It’s indispensable to get a good sense of the data.
There is an alternative statistical framework, Bayesian statistics, with a different approach to intervals.
In parametric statistics you assume there is a true parameter value and your constructed interval either does or does not contain it. The confidence in this case, say 95%, means that 95 out of 100 times an interval so defined will contain the true parameter value.
In Bayesian stats, you assume that the underlying parameters are themselves random variables (this seems very correct biologically in many cases, to me). You then construct a “credibility interval”, which expresses the uncertainty of the underlying parameter directly. In other words, it is then correct to say the parameter has a 95% probability of being in *that particular interval*.
The difference may seem like philosophical hair-splitting, but the Bayesian approach to statistics allows you to integrate prior knowledge/data, whereas in parametric statistics, you assume each experiment or data-set is its own universe and should be analyzed in isolation…
Cheers!
Chris